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Abstract 


This  dissertation  applies  reinforcement  learning  to  the  adaptive  control  of  ac¬ 
tive  sensory-motor  systems.  Active  sensory-motor  systems,  in  addition  to  pro¬ 
viding  for  overt  action,  also  support  act've,  selective  sensing  of  the  environment. 
The  principal  advantage  of  this  active  approach  to  perception  is  that  the  agent’s 
internal  representation  can  be  made  highly  task  specific  —  thus,  avoiding  wasteful 
sensory  processing  and  the  representation  of  irrelevant  information.  One  unavoid¬ 
able  consequence  of  active  perception  is  that  improper  control  can  lead  to  internal 
states  that  confound  functionally  distinct  states  in  the  external  world.  This  phe¬ 
nomenon,  called  perceptual  aliasing,  is  shown  to  destabilize  existing  reinforcement 
learning  algorithms  with  respect  to  optimal  control. 

To  overcome  these  difficulties,  an  approach  to  adaptive  control,  called  the 
Consistent  Representation  (CR)  method,  is  developed.  This  method  is  used  to 
construct  systems  that  learn  not  only  the  overt  actions  needed  to  solve  a  task,  but 
also  where  to  focus  their  attention  in  order  to  collect  necessary  sensory  informa¬ 
tion.  The  principle  of  the  CR-method  is  to  separate  control  into  two  stages:  an 
identification  stage,  followed  by  an  overt  stage.  The  identification  stage  generates 
the  task-specific  internal  representation  that  is  used  by  the  overt  control  stage. 
Adaptive  identification  is  accomplished  by  a  technique  that  involves  the  detec¬ 
tion  and  suppression  of  perceptually  aliased  internal  states.  Q-learning  is  used  for 
adaptive  overt  control. 

The  technique  is  then  extended  to  include  two  cooperative  learning  mecha¬ 
nisms,  called  Learning  with  an  External  Critic  (LEG)  and  Learning  By  Watching 
(LEW),  respectively,  which  significantly  improve  learning.  Cooperative  mecha¬ 
nisms  exploit  the  presence  of  helpful  agents  in  the  environment  to  supply  auxil¬ 
iary  sources  of  trial-and-error  experience  and  to  decrease  the  latency  between  the 
execution  and  evaluation  of  an  action. 
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1  Introduction 


For  the  better  part  of  thirty  years  research  in  AI  has  focused  on  high  level,  cog¬ 
nitive  aspects  of  intelligence.  Topics  like  planning,  problem  solving,  natural  lan¬ 
guage  understanding,  knowledge  representation,  and  reasoning  have  traditionally 
been  at  the  core  of  AI.  However,  abstract  thought  is  only  a  part  of  intelligence  and 
is  generally  useless  without  sensors  and  effectors  to  ground  it  in  the  real  world. 
For  these  reasons  an  increasing  number  of  researchers  have  begun  to  look  at  some 
of  the  more  mundane  aspects  of  intelligent  behavior  that  can  be  modeled  with 
complete  behaving  systems.  More  and  more  researchers  are  building  robots  that 
act  in  the  real  world  instead  of  complex  systems  that  act  in  artificial,  simulated,  or 
disembodied  worlds.  This  shift  has  had  two  important  effects.  First,  it  has  forced 
the  reexamination  of  basic  assumptions  that  underlie  traditional  approaches  to 
intelligent  behavior,  and,  second,  it  has  led  to  research  in  areas  that  address  the 
shortcomings  of  traditional  approaches. 

Two  topics  that  are  currently  generating  a  great  deal  of  interest  are  active 
perception  and  reinforcement  learning.  Active  perception  is  concerned  with  devel¬ 
opment  of  sensory  systems  that  are  dynamically  controlled  to  selectively  process 
and  represent  precepts  about  the  environment  in  a  task-dependent  way.  Rein¬ 
forcement  learning  is  concerned  with  the  adaptive  control  of  an  agent  through  the 
use  of  scalar  rewards  (for  feedback)  and  direct  trial-and-error  interaction  with  the 
environment. 

Active  perception  and  reinforcement  learning  are  both  important  to  the  de¬ 
velopment  of  intelligent  agents.  Active  perception  is  needed  for  efficient,  realistic 
perception,  and  reinforcement  learning  is  important  to  the  development  of  adap¬ 
tive  systems  that,  among  other  things,  do  not  rely  too  heavily  upon  a  priori 
domain  knowledge.  To  date,  these  two  lines  of  research  have,  for  the  most  part, 
progressed  independently.'  Work  on  active  perception  has  focused  primarily  on 

*  Notable  exceptions  include  the  work  described  in  this  thesis  (Whitehead  and  Ballard,  1990; 
Whitehead  and  Ballard,  1991a],  work  by  Tan  on  learning  cost  sensitive  internal  representa¬ 
tions  [Tan,  1991b;  Tan,  1991a],  Chapman  and  Kaelbling’s  work  on  the  generalization  problem 
(Chapman  and  Kaelbling,  1991],  and  Schmidhuber’s  work  on  learning  to  control  visual  attention 
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understanding  the  benefits  of  active  vision,  on  developing  ^ptoaches  to  active 
vision,  and  on  building  systems  that  use  it.  Little  concern  has  been  focused  on 
how  an  agent  might  actually  learn  to  control  an  active  visual  system.  Similarly, 
most  work  in  reinforcement  learning  has  ignored  perceptual  issues  completely  by 
assunung  that  the  agent  at  each  time  step  has  sensory  inputs  that  completely 
describe  the  state  of  the  world  with  respect  to  the  task. 


1.1  Active  Perception 

The  vast  majority  of  work  in  AI  has  not  dealt  realistically  with  perception.  Typ¬ 
ically,  perception  is  abstracted  out  of  consideration  in  order  to  focus  on  more 
central  decision  making  issues.  It  is  common  to  assume  that  a  decoupled  (often 
implicit)  sensory  system  provides  the  decision  system  with  an  internal  represen¬ 
tation  that  completely  and  accurately  describes  the  state  of  the  external  world. 
This  representation  frequently  takes  the  form  of  a  set  of  propositions  that  de¬ 
scribe  the  relationships  between,  and  the  features  of,  every  potentially  relevant 
object  ♦he  domain.  Even  for  simple  toy  domains  this  reconstructive  approach 
to  perception  places  an  unrealistic  burden  on  the  sensory  system  and  leads  to 
internal  representations  rife  with  irrelevant  information.  For  example,  in  a  real 
world  version  of  the  blocks-world  it  is  unrealistic  to  expect  a  sensory  system  to 
analyze  more  than  a  few  blocks  at  a  time.  Moreover,  if  there  are  n  blockt  in  the 
world  and  each  is  represented  using  traditional  methods,  then  the  size  of  the  state 
space  is  0(n!)  [Ginsberg,  1989).  For  n  =  20  the  state  space  has  over  forty  billion 
(42, 949, 672, 940)  states. 

The  large  amount  of  information  encoded  in  these  representations  is  difficult 
to  deal  with,  but  more  importantly,  most  of  it  is  irrelevant  to  the  immediate 
task  facing  the  agent.  It  only  interferes  with  decision  making  (and  learning) 
by  clogging  the  system  with  irrelevant  detail.  The  situation  deteriorates  even 
further  when  we  consider  agents  whose  tasks  are  numerous,  complex,  and  not 
well  understood  ahead  of  time.  Under  these  circumstemces,  complete  internal 
representations  will  necessarily  have  to  encode  even  more  information  that  is  likely 
to  be  even  less  useful  at  any  given  point  in  time.  If  intelligent  robots  are  to  be 
achieved,  then  intelligent  sensing  strategies  that  balance  generality  and  flexibility 
with  computational  feasibility  must  be  developed. 

Active  perception  represents  a  promising  approach  to  this  challenge.  The  cen¬ 
tral  tenet  of  active  perception  is  that  an  agent’s  sensory  system  is  an  information 
collecting  resource  that  is  at  least  partially  under  the  control  of  the  agent’s  inter¬ 
nal  decision  processes.  By  directly  controlling  the  allocation  of  sensory  processing 
resources,  the  agent  selectively  monitors  and  represents  those  aspects  of  the  world 

[Schmidhuber,  1990a]. 


2 


that  are  immediately  relevant  to  the  task  at  hand  and  ignores  what  is  irrelevant. 
A  key  assumption  is  that  at  any  ^ven  time  only  a  relatively  small  amount  of  infor¬ 
mation  is  needed  for  decision  making.  By  exploiting  this  assumption,  the  amount 
of  computation  required  for  sensing  reflects  the  complenty  of  the  agent's  task  and 
not  the  complexity  of  the  world  in  whidi  it  is  embedded.  Effidenr^  is  attained 
by  carefully  controlling  the  selection  and  application  of  computational  resources 
so  that  only  relevant  aspects  of  the  environment  are  processed,  thus  generating 
internal  representations  that  are  minimal  in  size  and  task-specific.  Flexibility  and 
generality  are  attained  by  providing  the  system  with  a  range  of  sensory  processing 
resources  that  can  be  flexibly  applied  to  different  parts  of  the  environment.  Ad¬ 
ditional  flexibility  is  gained  when  processing  resources  define  primitive  operations 
that  can  be  composed  to  define  complex  sensing  routines  [Ullman,  1984]. 

Human  vision  is  a  perfect  example  of  active  perception.  We  move  our  eyes 
to  allocate  our  visual  processing  resources  (e.g.,  foveal  vision)  on  those  aspects  of 
the  world  that  are  most  important  to  us.  This  point  has  been  elegantly  demon¬ 
strated  by  Yarbus  [Yarbus,  1967],  who  showed  that  a  subject’s  eye  movements 
are  task-dependent  (i.e.,  see  Figure  1).  Yarbus’  seminal  experiments  on  hu¬ 
man  eye  movements  have  been  followed  by  considerable  research  in  psychology 
and  artificial  intelligence  (not  to  mention  physiology,  anatomy,  and  neuroscience) 
aimed  at  better  understanding  selective  vision  in  man  and  machine.  Some  of 
this  research  includes  work  analyzing  the  computational  advantages  of  active 
vision  [Aloimonos  et  al,  1987;  Ballard  and  Ozcandarli,  1988;  Ballard,  1989a; 
Ballard,  1991;  Simmons,  1990;  Tsotsos,  1987],  work  on  architectures  for  vision  and 
visual  sensing  strategies  [Agre,  1988;  Bajcsy  and  Allen,  1984;  Chapman,  1990b; 
Chapman  and  Kaelbling,  1991;  Chrisman  and  Simmons,  1991;  Dickmanns,  1989; 
Garvey,  1976;  Rimey  and  Brown,  1990;  Swain,  1990;  Tan  and  Schlimmer,  1990; 
Ullman,  1984;  Romanycia,  1987;  Romanycia,  1988]  and  aspects  of  active  vision 
in  humans  [Chapman,  1990a;  Noton,  1970;  Noton  and  Stark,  1971a;  Noton  and 
Stark,  1971b;  O’Regan  and  Levy-Schoen,  1983;  Treismann  and  Gelade,  1980; 
Ullman,  1984]. 


1.2  Reinforcement  Learning 


An  assumption  that  is  almost  universal  in  AI  is  that  the  agent  has  an  a  priori 
domain  model  which  it  uses  to  reason  about  possible  courses  of  aiction.  These 
models  are  essential  to  traditional  planning  approaches  since  they  are  required 
for  search  and  plan  generation.  In  most  cases,  the  model  takes  the  form  of  a 
set  of  individual  operator  models  (Fikes  and  Nilsson,  1971;  Laird  et  a/.,  1986]  or 
a  set  of  frame  axioms  [Hayes,  1973;  McCarthy,  1977]  that  are  used  to  predict 
the  effects  of  actions.  It  also  common  to  assume  that  the  model  is  complete. 
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Figure  1.1:  (reproduced  from  [Yarbus  1967])  A  reproduction  of  I.  E.  Repin’s 
painting  “An  Unexpected  V^isitor”  and  records  of  seven  eye  movement  traces  for 
the  same  subject.  Each  record  lasted  3  minutes.  The  subject  examined  the 
reproduction  with  both  eyes.  1)  Free  examination  of  the  picture.  Before  the 
subsequent  recording  sessions,  the  subject  was  asked  to:  2)  estimate  the  material 
circumstances  of  the  family  in  the  picture;  3)  give  the  ages  of  the  people;  4)  surmise 
what  the  family  had  been  doing  before  the  arrival  of  the  “unexpected  visitor”;  5) 
remember  the  clothes  worn  by  the  people;  6)  remember  the  position  of  the  people 
and  objects  in  the  room;  7)  estimate  how  long  the  “unexpected  visitor”  had  been 
away  from  the  family. 


accurate  and  stationary.  Unfortunately  for  real-world  tasks,  models  that  satisfy 
these  assumptions  are  hard  to  come  by  for  the  foUovdng  reasons: 

1.  The  task  domain  may  not  be  well  enough  understood  to  formulate  an  accu¬ 
rate  a  priori  model,  in  some  cases  it  may  be  that  the  task  is  too  complex 
or  the  domain  too  unconstrained.  In  other  cases,  it  may  be  that  the  task  to 
be  performed  cannot  be  anticipated  in  advance  (e.g.,  as  might  be  the  case 
for  a  general  purpose  robot  that  gets  trained  oil  site  or  whose  tasks  change 
over  time). 

2.  Even  v?hen  the  task  domain  is  well  known,  it  is  often  difficult  to  formulate 
a  model  that  is  accurate  and  complete.  This  is  especially  true  of  real-world 
tasks  [Shafer,  1990]  and  when  using  symbolic  models  [McCarthy  and  Hayes, 
1969;  Hayes,  1973]. 

3.  Most  classical  planning  techniques  depend  on  the  world  being  deterministic. 
This  assumption  makes  it  difficult  to  apply  classical  techniques  to  problems 
that  are  inherently  stochastic  and  makes  classical  planning  unlikely  for  the 
real  world,  where  unexpected  events  are  commonplace.^ 

4.  The  real  world  is  constantly  changing.  Tools  wear  out,  parts  break,  objects 
get  moved.  The  real  world  is  non-stationary,  and  fixed  a  priori  models  of 
it  are  going  to  be  inadequate.  An  agent’s  model  must  be  adaptable  so  that 
it  can  capture  these  changes.  Most  symbolic  models  are  far  too  fragile  and 
inexpressive  to  afford  incremental  adaptation. 

Computational  models  of  intelligent  control  must  not  depend  too  heavily  upon 
complete  and  accurate  a  priori  domain  models.  If  control  depends  upon  an  explicit 
model  at  all,  it  must  suffice  for  it  to  be  incomplete  and  inaccurate.  If  domain 
knowledge  is  available,  then  the  agent  should  be  capable  of  exploiting  it,  but  it 
should  not  be  a  prerequisite  for  intelligent  control. 

Reinforcement  learning  offers  an  alternative  approach  to  control  that  does  not 
depend  upon  explicit,  c  priori  domain  models  [Minsky,  1954;  Michie  and  Cham¬ 
bers,  1968;  Barto  et  al,  1983;  Holland  et  al,  1986;  Sutton,  1988;  Watkins,  1989]. 
A  reinforcement  learning  system  is  any  system  that  through  direct  interaction  with 
its  environment  improves  its  performance  by  receiving  feedback  in  the  form  of  a 
scalar  reward  (or  penalty)  that  is  commensurate  with  the  appropriateness  of  its 
response.  By  improves  its  performance  we  mean  that  the  agent  uses  the  feediMck 
to  adapt  its  behavior  in  an  effort  to  maximize  some  measure  of  the  reward  ic  re¬ 
ceives  in  the  future.  IntuHively,  a  reinforcement  learning  system  can  be  viewed  as 

^Recently,  interest  in  p: joabilistic  reason'ng  and  planning  has  been  on  the  rise  (e.g.,  see 
[Pearl,  1988,  Dean  and  Wellman,  1991].)  Unfortunately,  the  shift  towards  probabilistic  models 
seems  only  to  have  increased  the  computational  complexity  of  classical  planning  methods. 


h 


a  hedonistic  automaton  whose  sole  objective  is  to  maximize  the  positive  (reward) 
and  minimize  the  negative  (penalty). 

The  principles  of  reinforcement  learning  are  pertinent  to  intelligent  control 
since  they  lead  to  systems  that  do  not  depend  upon  a  pnon  knowledge  for  decision 
making.  Instead  of  using  an  explicit  domain  model  to  generate  a  sequence  of 
actions  (a  plan)  which  is  then  executed  open  loop  as  in  classical  planning,  a 
reinforcement  learning  system  maintains  an  explicit  policy  function  (analogous 
to  a  universal  plan  [Schoppers,  1989b;  Schoppers,  1989a])  that  maps  situations 
directly  into  actions.  In  this  case,  at  each  point  in  time  decision  making  reduces 
to  computing  (looking  up)  the  value  of  the  policy  function  for  the  current  situation. 
If  the  agent’s  policy  is  correct,  then  its  performance  is  optimal.  If  it  is  not,  then 
the  policy  is  incrementally  improved  through  direct  experience  with  the  world. 
By  relying  on  the  world  directly  for  feedback  (reinforcement)  these  systems  avoid 
many  of  the  pitfalls  aissociated  with  model-based  control  methods.^ 

Agents  based  on  reinforcement  learning  confer  a  number  of  other  advantages 
as  well.  First,  reinforcement  learning  systems  are  both  situated  and  reactive: 
they  can  respond  quickly  to  unexpected  contingencies  and  opportunities  [Agre, 
1988],  There  is  some  confusion  in  the  literature  about  the  distinction  between 
situated  and  reactive.  Situatedness  and  reactiveness  are  two  distinct  properties. 
Both  are  required  for  intelligent  behavior.  An  agent  is  situated  if  its  control 
decision  is  based  on  the  immediate  situation  (as  determined  by  sensor  readings 
and  possibly  a  limited  amount  of  internal  state);  an  agent  is  reactive  if  it  gener¬ 
als  actions/behavior  at  a  rate  that  is  commensurate  with  the  dynamics  of  the 
environment  in  which  it  is  embedded.  Reinforcement  learning  systems  are  sit¬ 
uated  since  decision  making  is  usually  based  on  the  immediate  situation.  They 
are  reactive  since  decision  making  consists  of  evaluating  a  policy  function,  which 
typically  requires  a  small  constant  amount  of  time.  Second,  since  reinforcement 
learning  systems  incrementally  adapt  their  policies  based  on  experience  accu¬ 
mulated  over  time,  they  are  effective  for  control  tasks  that  are  stochastic  and 
(under  appropriate  conditions)  non-stationary.  Also,  reinforcement  learning  sys¬ 
tems  can  exploit  domain  knowledge  when  it  is  available.  This  can  be  achieved 
by  1)  using  a  priori  knowledge  to  determine  a  good  initial  policy  [Franklin, 
1988],  2)  using  a  domain  model  to  perform  hypothetical  experiments  instead 
of  relying  solely  on  trial-and-error  experiments  in  the  world  [Whitehead,  1989; 

fundamental  assumption  implicit  in  reinforcement  learning  is  that  an  agent  over  the 
course  of  its  lifetime  is  presented  with  the  same  set  of  problems  over  and  over  [Agre,  1985].  The 
solutions  to  these  problems  are  encoded  directly  in  the  policy  function.  Once  a  solution  to  a 
particular  problem  is  learned  (encoded  in  the  policy),  future  occurrences  of  the  problem  can  be 
solved  by  simply  following  the  instructions  encoded  in  the  policy  —  no  planning  or  reasoning  is 
required.  A  new  problem  (i.e.,  one  which  cannot  be  reduced  to  a  previously  learned  problem)  or 
a  change  in  an  existing  problem  will  initially  lead  to  degraded  performance  because  the  policy 
will  not  encode  a  solution.  In  this  case  reasoning/problem  solving  may  be  used,  or  the  system 
can  learn  the  task  by  trial-and-error  experimentation  in  the  environment. 
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Sutton,  1990a;  Lin,  1990],  and  3)  using  a  model  to  generalize  the  results  of  ex¬ 
periments  for  better  credit  assignments  [Yee  et  al.^  1990].  Also,  because  learn¬ 
ing  is  incremental,  these  models  need  not  be  complete  or  accurate.  This  ro¬ 
bustness  in  the  face  of  incomplete  models  has  led  to  systems  that  profit  by  us¬ 
ing  models  which  themselves  have  been  learned  [Sutton,  1990a;  Sutton,  1990b; 
Lin,  1990]. 

The  idea  of  using  rewards  and  penalties  as  feedback  for  adaptive  systems 
dates  back  at  least  to  the  late  fifties  and  Marvin  Minsky  [Minsky,  1954],  who 
studied  automata  that  learned  to  solve  a  series  of  sequential  decision  problems 
(maize  problems)  by  adjusting  their  decision  rules  based  upon  the  receipt  of  re¬ 
wards  and  penalties.  Since  that  time,  reinforcement  learning  has  been  studied 
widely  and  from  a  variety  of  perspectives.  Some  highlights  include  Minsky’s  early 
maize  learning  automata  [Minsky,  1954];  Samuel’s  checker  player  [Samuel,  1963]; 
Michie  and  Chambers’  ’boxes’  algorithm  [Michie  and  Chambers,  1968];  Holland’s 
classifier  systems  and  the  bucket  brigade  algorithm  [Holland  and  Reitman,  1978; 
Holland  et  al.^  1986];  Sutton’s  Adaptive  Heuristic  Critic  [Sutton,  1984;  Barto 
et  ul,  1983],  its  neural  implementation  [Anderson,  1986],  and,  subsequently,  Sut¬ 
ton’s  Theory  of  Temporal  Difference  Methods  [Sutton,  1988].  More  recently,  the 
relationship  between  reinforcement  learning  and  dynamic  programming  has  been 
established  [Watkins,  1989;  Werbos,  1987]  and  a  mathematical  theory  of  reinforce¬ 
ment  learning  is  beginning  to  emerge  (e.g.,  see  [Barto  et  a/.,  1991]).  Other  recent 
results  address  issues  such  as  modularity  [Booker,  1988;  Riolo,  1988;  Mahade- 
van  and  Connell,  1991;  Singh,  1991;  Wixson,  1991],  integration  of  planning,  ac¬ 
tion,  and  learning  [Whitehead  and  Ballard,  1989a;  Whitehead  and  Baillard,  1989b; 
Whitehead,  1989;  Sutton,  1990a;  Sutton,  1990b;  Sutton,  1991;  Lin,  1990],  faster 
credit  assignment  [Yee  et  al,  1990;  Lin,  1991;  Whitehead,  1991],  statistical  founda¬ 
tions  for  better  exploration  strategies  [Kaelbling,  1990],  selective  perception  and 
generalization  [Whitehead  and  Ballard,  1991a;  Chapman  and  Kaelbling,  1991; 
Tan,  1991a],  and  neural  implementations  [Williams,  1987;  Schmidhuber,  1990b]. 


1.3  Principal  Contributions 

This  dissertation  examines  architectures  that  combine  both  active  sensory  systems 
(for  feasible,  task-dependent  perception)  and  reinforcement  learning  (for  adaptive, 
non-model-based  control).  It  is  shown  that  incorporating  active  perception  and 
reinforcement  learning  into  a  single  system  is  non-trivial  because  of  subtle  inter¬ 
actions  that  prevent  the  system  from  learning  an  adequate  control  policy.  What 
makes  learning  in  this  context  difficult  is  that,  in  addition  to  learning  the  overt 
actions  needed  to  solve  a  problem,  the  agent  must  also  discover  how  to  control  its 
sensory  system  (e.g.,  focus  its  attention)  ri  order  to  represent  accurately  the  state 
of  the  world  with  respect  to  the  task.  If  the  agent  selectively  attends  to  the  few 
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key  objects  relevant  to  the  task,  then  its  internal  state  accurately  represents  the 
world.  If,  however,  the  agent  does  not  attend  to  those  key  objects,  the  intern;tl 
state  may  say  nothing  useful  about  the  world.  A  dilemma  arises:  in  order  for  the 
agent  to  learn  to  solve  a  task,  it  must  accurately  represent  the  world  with  recpect 
to  the  task;  but,  in  order  for  the  agent  to  learn  an  accurate  representation,  it  must 
in  some  sense  know  how  to  solve  the  task. 

The  difficulty  arises  when  the  sensory  system,  due  to  improper  control,  gener¬ 
ates  an  internal  state  that  represents  two  or  more  functionally  different  situations 
in  the  external  world.  These  internal  states  are  said  to  be  inconsistent  and  this 
undesirable  overloading  of  internal  states  is  called  perceptual  aliasing.  Perceptual 
aliasing  is  shown  to  interfere  severely  with  most  existing  reinforcement  learning 
algorithms  by  making  it  impossible,  in  certain  situations,  to  estimate  accurately 
the  utility  of  performing  an  action.  Perceptual  aliasing  is  a  fundamental  obstacle 
to  adaptive  intelligent  control  since  it  not  only  arises  in  active  perception  but  is 
also  inherent  in  virtually  every  abstraction  and  generalization  mechanism  —  that 
is,  perceptual  aliasing  can  occur  any  time  it  is  possible  to  ignore  information  that 
is  relevant  to  decision  making  and  utility  estimation. 

To  surmount  the  problems  caused  by  perceptual  aliasing  an  adaptive  control 
technique,  called  the  Consistent  Representation  (CR)  method,  is  developed.  In 
the  CR-method  control  is  accomplished  in  two  distinct  phases:  a  state  identi¬ 
fication  phase,  followed  by  an  overt  control  phase.  During  state  identification, 
the  system  executes  sensory-control  actions  in  an  attempt  to  generate  an  inter¬ 
nal  representation  that  accurately  identifies  the  state  of  the  external  world  with 
respect  to  the  task.  Next,  during  overt  control,  this  internal  representation  is 
used  to  generate  the  overt  actions  needed  to  perform  the  task.  Both  the  state 
identification  and  the  overt  control  stages  are  adaptive.  Learning  in  the  overt 
control  stage  is  beised  primarily  on  standard  reinforcement  learning  techniques  — 
that  is,  an  overt  control  policy  is  adjusted  to  maximize  expected  future  rewards. 
Learning  for  state  identification  is  somewhat  different.  The  objective  of  state 
identification  is  to  generate  internal  states  that  are  somehow  adequate.  In  the 
CR-method,  adequacy  is  defined  in  terms  of  whether  or  not  an  internal  state  is 
consistent.  By  detecting  internal  states  that  are  inconsistent,  the  identification 
procedure  used  by  the  sensory  controller  can  be  adapted  and  inconsistent  states 
can  be  eliminated.  When  this  is  achieved  the  sensory  controller  has  learned  to 
generate  a  task-dependent  internal  representation.  In  the  CR-method,  perceptual 
and  overt  control  is  learned  incrementally  and  simultaneously.  This  follows  since 
the  overt  control  policy  cannot  be  completely  learned  until  some  of  a  consistent 
internal  representation  is  well  established,  while  perceptual  control  cannot  be  ac¬ 
complished  until  the  overt  controller  learns,  at  least  partially,  how  to  solve  the 
task.  This  interaction  results  in  systems  that  first  learn  to  represent  and  solve 
easy  instances  of  a  task  (e.g.,  the  last  few  steps  of  a  task),  and  then,  through 
a  bootstrapping  process,  learn  to  solve  more  and  more  difficult  instances.  The 
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CR-method  is  demonstrated  on  a  robot  that  learns  a  simple  block  manipulation 
task  that  requires  the  control  of  an  active  sensory-motor  system. 

Reinforcement  learning  systems  are  plagued  by  unstructured  initial  search. 
In  particular,  when  rewards  and  punishments  are  rare,  an  agent  may  execute 
a  long  sequence  of  actions  before  it  receives  feedback  essential  for  learning.  If 
the  agent  knows  little  or  nothing  about  the  task  a  priori,  then  during  the  initial 
phases  of  learning  lack  of  feedback  can  lead  to  long  random  walks  in  search  of 
rewards.  As  the  size  and  complexity  of  the  state  space  is  scaled  these  random 
walks  quickly  become  prohibitive.  We  define  cooperative  mechanisms  that  help 
reduce  search  by  providing  the  agent  with  shorter  latency  feedback  and  auxiliary 
sources  of  experience.  The  principal  motivation  for  cooperative  mechanisms  is 
that,  in  nature,  intelligent  agents  do  not  exist  in  isolation,  but  are  embedded  in  a 
benevolent  society  that  guides  and  structures  learning.  Humans  learn  by  watching 
others,  by  being  told,  and  by  receiving  criticism  and  encouragement.  Learning 
more  often  involves  knowledge  transfer  than  discovery.  Similarly,  intelligent  robots 
cannot  be  expected  to  learn  complex  tasks  in  isolation  by  trial-and-error  alone. 
Instead,  they  must  be  embedded  in  cooperative  environments,  and  algorithms 
must  be  developed  to  facilitate  the  transfer  of  knowledge  among  them. 

In  this  work,  two  cooperative  learning  techniques  are  proposed  and  demon¬ 
strated.  The  first,  called  Learning  with  an  External  Critic  (  LEG),  is  based  on 
the  idea  of  a  mentor,  who  watches  the  agent  and  generates  immediate  rewards 
in  response  to  the  agent’s  most  recent  actions.  This  reward  is  used  to  bias  tem¬ 
porarily  the  agent’s  control  strategy.  The  second  algorithm,  called  Learning  By 
Watching  (LEW)  is  based  on  the  idea  that  an  agent  can  gain  valuable  experiences 
vicariously  by  relating  the  observed  experiences  of  others  to  its  own.  These  two 
algorithms  are  demonstrated  in  the  block  stacking  domain  and  shown  to  improve 
the  learning  rate  substantially.  Also,  the  search  time  complexity  for  these  algo¬ 
rithms,  along  with  a  popular  reinforcement  learning  algorithm,  is  analyzed  for 
a  restricted  (but  representative)  set  of  learning  tasks.  The  results  indicate  that 
under  certain  circumstances  a  popular  algorithm  (Q-learning)  can  be  expected  to 
require  time  at  least  exponential  in  the  size  of  the  state  space,  while  the  LEG  and 
LEW  algorithms  require  time  at  most  linear  in  the  size  of  the  state  space,  and 
under  appropriate  conditions  may  only  require  time  proportional  to  the  length  of 
the  optimal  solution  path.  While  these  analytic  results  apply  only  to  a  restricted 
class  of  tasks,  they  shed  light  on  the  complexity  of  search  in  reinforcement  learning 
in  general  and  the  value  of  these  cooperative  mechanisms  for  reducing  search. 


1.4  Thesis  Outline 

The  technical  results  described  in  this  dissertation  were  obtained  by  analyzing 
a  series  of  systems  developed  to  solve  tasks  in  a  simple  simulated  blocks  world 


domain.  Even  though  this  domadn  is  quite  simple,  these  systems  proved  useful  for 
developing  my  intuition.  For  this  reason,  the  dissertation  chronicles  the  evolution 
of  these  systems,  collectively  called  Meliora.  I  begin  by  describing  the  first  system 
built,  which  took  a  most  straightforward  approach  and  failed  miserably;  then,  after 
analyzing  its  failure,  I  describe  a  system  based  on  the  CR*method,  which  succeeds 
in  learning  the  task  but  is  slow;  following  that,  I  demonstrate  the  potential  of 
cooperative  mechanisms.  Grounding  the  discussion  in  the  blocks  world  makes 
exposition  easier  and  an  intuitive  understanding  more  likely.  At  various  points  in 
the  discussion  the  results  are  generalized  in  a  more  formal  theory. 

The  remainder  of  the  dissertation  is  organized  as  follows.  Chapters  2  and  3 
review  background  material  from  the  areas  of  active  perception  and  reinforcement 
learning,  respectively.  The  emphases  in  these  chapters  are  on  visual  routines  and 
deictic  sensory-motor  systems  (for  active  vision)  and  Q-learning  (for  reinforcement 
learning)  since  these  are  the  techniques  used  by  the  block  stacking  systems.  Read¬ 
ers  familiar  with  active  perception,  Markov  decision  processes,  and  Q-learning  may 
wish  to  quickly  skim  Chapters  2  and  3  or  skip  directly  to  Chapter  4.  Chapter  4  de¬ 
scribes  the  block  stacking  task.  This  discussion  includes  a  description  of  the  deictic 
sensory-motor  used  by  adl  the  robots.  A  formal  model  for  describing  agents  that 
integrate  active  perception  and  reinforcement  learning  is  also  developed.  Chapter 
5  describes  experiences  with  the  first  block  stacking  agent,  analyzes  its  fulure  and 
formalizes  the  interactions  caused  by  perceptual  aliasing.  Chapter  6  describes 
a  specific  algorithm  developed  to  overcome  the  effects  of  perceptual  aliasing  for 
the  block  stacking  task.  Generalizing  this  algorithm  leads  to  the  CR-method.  In 
Chapter  7  we  focus  on  cooperative  mechanisms  for  improving  the  learning  rate. 
This  chapter  describes  the  principles  of  LEC  and  LBW,  demonstrates  them  in 
several  block  stacking  systems,  and  presents  a  formal  analysis  of  the  search  time 
complexity  of  these  algorithms.  Chapter  8  discusses  the  limitations  of  the  CR- 
method  and  identifies  areas  for  future  research.  Conclusions  are  drawn  in  Chapter 
9. 
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2  Background:  Active 
Perception 


This  chapter  and  the  next  review  essential  background  material  needed  to  moti¬ 
vate  and  understand  the  work  described  in  this  dissertation.  This  chapter  focuses 
on  active  perception  and  Chapter  3  focuses  on  reinforcement  learning.  These 
chapters  are  not  intended  to  be  thorough  reviews  of  research  in  active  perception 
or  reinforcement  learning.  Rather,  the  emphasis  is  on  reviewing  specific  ideas  that 
form  the  conceptual  foundation  on  which  our  work  is  based.  With  respect  to  active 
perception,  we  focus  on  Ullman’s  Visual  Routines  model  of  human  intermediate 
level  vision  (Ullman,  1984]  and  on  Agre  and  Chapman’s  theory  of  deictic  represen¬ 
tations  (Agre  and  Chapman,  1987;  Agre,  1988;  Chapman,  1990b].  With  respect 
to  reinforcement  learning,  we  focus  on  Watkins’  Q-learning  algorithm  [Watkins, 
1989]. 


2.1  Principles  of  Active  Perception 

The  purpose  of  a  sensory  system  should  be  to  provide  the  agent’s  internal  decision 
processes  with  enough  information  to  make  effective  control  decisions.  Whether 
or  not  a  robot’s  sensors  provide  it  with  adequate  information  depends  upon  the 
task  being  performed  and  the  capabilities  of  the  robot’s  internal  decision  system. 
In  any  case,  an  important  decision  in  the  design  of  a  robot  is  the  choice  of  the 
aspects  of  the  environment  to  sense  at  any  given  point  in  time.  The  most  common 
approach  taken  in  AI  is  to  adopt  a  fixed  sensory  system  that  continually  monitors 
all  potentially  relevant  features  of  the  environment.  In  classical  planning,  this 
approach  is  implicit  in  the  objective  representations  that  at  each  point  in  time 
oomplelely  describe  the  state  of  the  world  (e.g.,  [Pikes  et  a/.,  1972;  Sacerdoti, 
19771).  In  behavioral  based  approaches,  it  is  common  to  allocate  (or  assume 
tho  existence  of)  sensory  processes  that  continually  monitor  all  relevant  sensory 
attributes  (e.g.,  [Brooks,  1986;  Kaelbling,  1987]).  While  this  approach  may  be 
appropriate  for  relatively  simple  tasks,  where  the  number  of  relevant  attributes  is 
small  and  v.ell  known,  it  fails  to  scale  to  more  complex  tasks. 
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In  particular,  as  the  number,  complexity,  and  diversity  of  tasks  performed  by 
a  robot  increase,  the  number  of  potentially  relevant  features  needed  for  decision 
making  explodes,  and  it  quickly  becomes  impossible  to  monitor  continually  every 
attribute  that  is  potentially  relevant.  The  problem  is  exacerbated  when  the  task 
to  be  performed  is  not  well  understood  ahead  of  time,  since  in  this  case  the  set  of 
potentially  relevant  attributes  is  potentially  unbounded. 

Under  these  circumstances  an  active  approach  to  perception  is  more  appro¬ 
priate.  The  principal  idea  of  active  perception  is  to  control  the  allocation  of 
sensory  processing  resources  in  order  to  analyze  selectively  only  those  aspects  of 
the  environment  that  are  actually  relevant  to  the  immediate  decision  at  hand. 
Following  this  approach,  generality  is  attained  by  using  a  powerful  set  of  sensory 
processing  resources  that  can  be  flexibly  applied  to  different  parts  of  the  environ¬ 
ment  and  combined  to  define  complex  perceptual  functions.  Efficiency  is  attained 
by  selectively  applying  these  resources  only  as  needed  to  satisfy  the  immediate 
information  requirements  of  the  internal  decision  system. 

With  respect  to  task  performance,  the  idea  is  that  as  the  robot  progresses 
through  different  stages  of  a  task  or  proceeds  from  one  task  to  another,  the  chang¬ 
ing  information  needs  of  the  robot  are  tracked  by  actively  controlling  the  compu¬ 
tations  performed  by  the  sensory  system.  The  degree  to  which  the  set  of  relevant 
features  changes  over  time  depends  largely  on  the  range  of  tasks  being  performed, 
their  complexity,  and  the  capabilities  of  the  internal  decision  system  to  maintain 
internal  state  (or  memories).  For  instance,  a  robot  that  performs  two  very  similar 
tasks  may  find  that,  to  a  large  degree,  both  tasks  share  the  same  set  of  relevant 
features,  whereas  a  robot  that  performs  two  dissimilar  tasks  may  find  that  the 
relevant  features  for  the  two  tasks  are  nearly  disjoint.  For  complex  tasks,  changes 
in  relevant  information  may  accompany  transitions  between  stages  of  the  task. 
The  amount  of  context  (or  memory)  maintained  by  a  robot  also  determines  the 
dynamics  of  systems  information  requirements.  A  robot  that  maintains  informa¬ 
tion  about  its  previous  states  may  only  need  to  verify  that  its  most  recent  action 
had  the  intended  effect,  whereas  a  memoryless  robot  may  need  to  use  its  sensors 
to  reestablish  context  at  each  point  in  time. 

With  respect  to  learning,  active  perception  provides  a  flexible  means  for  sam¬ 
pling  and  monitoring  a  tremendous  range  of  potentially  relevant  features  (needed 
during  learning),  while  it  also  provides  for  the  efficient  generation  of  task-dependent 
internal  representations  once  the  relevant  sensory  information  has  been  discovered. 

The  key  assumption  exploited  by  the  active  perception  paradigm  is  that  at  any 
point  in  time  the  number  of  features  actually  relevant  to  an  agent’s  immediate 
decision  is  relatively  small,  even  though  the  set  of  features  that  are  potentially 
relevant  (or  relevant  at  some  other  point  in  time)  may  be  large.  This  assumption 
appears  to  beconie  universally  valid  as  the  number,  complexity,  and  diversity  of 
tasks  to  be  learned  and  performed  by  an  agent  increase. 
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2.2  Visual  Routines 


In  1984,  Shimon  Ullman  [Ullman,  1984]  proposed  an  abstract  computational 
model  of  human  intermediate  level  vision.  The  model  was  developed  to  explain 
how  the  perception  of  spatial  properties  and  relationships  that  are  complex  from  a 
computational  standpoint  nevertheless  often  appear  deceivingly  immediate  and  ef¬ 
fortless  for  humans.  The  distinguishing  feature  of  Ullman’s  model  is  that  complex 
spatial  analysis  is  performed  by  a  set  of  sequential  processes  called  visual  routines. 
With  the  visual  routines  model,  Ullman  made  several  important  contributions  to 
the  study  of  visual  perception: 

1.  He  argued  for  active,  top-down  control  and  the  selective  application  of  visual 
processing  resources,  a  significant  departure  from  many  of  his  contempo¬ 
raries,  as  the  prevailing  dogma  emphasized  bottom-up  scene  reconstruction. 

2.  He  recognized  that  many  abstract  spatial  properties  and  relations  have  a  ■ 
certain  amount  of  essential  sequentiality  and  are  often  best  described  (and 
perceived)  using  sequential  algorithms. 

3.  He  argued  that  if  properly  chosen,  a  fixed  set  of  basic  visual  operations 
could  be  assembled  to  extract  an  unbounded  variety  of  shape  properties 
and  spatial  relations,  and  that  sharing  these  operations  could  yield  visual 
processing  systems  that  were  both  efficient  and  tremendously  versatile. 

4.  He  proposed  a  plausible  set  of  basic  operations  and  demonstrated  their  util¬ 
ity  on  a  number  of  difficult  spatial  reasoning  tasks. 

2.2.1  Model  Overview 

In  the  visual  routines  model,  the  computation  of  spatial  relations  is  divided  into 
two  stages.  The  first  stage  involves  the  bottom-up  creation  of  low  level  base 
representations.  The  second  stage  involve  the  application  of  visual  routines  to  the 
base  representations  to  extract  useful  spatial  properties. 

The  base  representations  are  derived  strictly  bottom  up.  They  are  assumed  to 
be  spatially  uniform,  viewer  centered,  and  unarticulated.  Examples  of  plausible 
base  representations  include  the  primal  sketch  and  the  2^-D  sketch  as  described  by 
Marr  [Marr,  1976;  Marr  and  Nishihara,  1978).  Local  information,  such  as  depth, 
color,  edge  orientation,  curvature,  motion,  and  texture,  is  represented  in  the  base 
representations. 

The  second  stage  of  processing  involves  the  application  of  visual  routines  to 
the  base  representations.  The  particular  visual  routine  applied  in  a  given  situ¬ 
ation  is  task-dependent  and  determined  by  the  information  needs  of  the  higher 
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level  (decision  making)  components  of  the  system.  Indeed,  Ullman  suggests  that 
visual  processing  be  viewed  as  a  query-answering  process.  Following  this  view, 
higher  level  decision  making  components  posit  queries  to  the  visual  system.  These 
queries  get  translated  into  visual  routines  which  are  applied  to  the  base  represen¬ 
tations.  The  results  of  this  processing  are  then  made  available  to  the  higher 
level  components  via  a  central  representation,  a  functional  analog  to  the  internal 
representations  of  classical  planning. 

The  processing  stages  of  the  visual  routine  model  are  depicted  in  Figure  2.1. 
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Figure  2.1:  The  processing  stages  of  the  visual  routines  model.  The  base  rep¬ 
resentation  is  generated  by  low-level,  local  visual  processes;  visual  routines  are 
applied  to  the  base  representation,  as  determined  by  top-down  feedback  from 
higher  level  components.  The  results  of  visual  routine  processing  are  placed  in  a 
central  representation  for  use  by  higher  level  decision-making  components.  Inter¬ 
mediate  results  may  also  be  stored  in  an  incremental  representation  and  used  for 
later  processing. 
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2.2.2  Operations 


Visual  routines  are  sequential  programs  composed  from  a  fixed  set  of  basic  oper¬ 
ations.  These  operations  are  assumed  to  be  implemented  in  hardware  and  shared 
between  routines.  Ullman  has  suggested  a  set  of  possible  operations,  based  on 
their  potential  usefulness,  and  demonstrated  their  utility  on  a  number  of  examples. 
These  operations  include:  shift  of  processing  focus^  indexing  to  an  odd-man-out 
location,  bounded  spreading  of  activation,  boundary  tracing,  and  marking.  Follow¬ 
ing  is  a  brief  description  of  each  of  these  operations  (as  taken  almost  verbatim 
from  [Ullman,  1984]  p.  155). 

Shift  of  the  processing  focus.  This  is  a  family  of  operations  that  allow 
the  application  of  the  same  basic  operation  to  different  locations  across 
the  base  representation. 

Indexing.  This  is  a  shift  operation  towards  special  odd-man-out  loca¬ 
tions.  A  location  can  be  indexed  if  it  is  sufficiently  different  from  its 
surroundings  in  an  indexable  property.  Indexable  properties  include 
contrast,  orientation,  color  motion,  and  perhaps  also  size,  binocular 
disparity,  curvature,  and  the  existence  of  terminators,  corners,  and 
intersections. 

Bounded  activation.  This  operation  consists  of  the  spreading  of  activa¬ 
tion  over  a  surface  in  the  base  representation,  emanating  from  a  given 
location  or  contour  and  stopping  at  discontinuity  boundaries.  This  is 
not  a  simple  operation,  since  it  must  cope  with  difficult  problems  of 
noise,  spurious  internal  contours,  and  fragmented  boundaries. 

Boundary  tracing.  This  operation  consists  of  either  the  tracing  of  a 
single  contour  or  the  simultaneous  activation  of  a  number  of  contours. 

The  operation  must  be  able  to  cope  with  the  difficulties  raised  by 
the  tracing  of  incomplete  boundaries,  tracing  across  intersections  and 
branching  points,  and  tracing  contours  defined  at  different  resolution 
scales. 

Marking..  The  operation  of  marking  a  location  means  that  this  location 
is  remembered,  and  processing  can  return  to  it  whenever  necessary. 

Such  operations  are  useful  in  the  integration  of  information  in  the 
processing  of  different  parts  of  a  complete  scene. 


2.2.3  Examples 

The  utility  of  visual  routines  and  the  plausibility  of  the  above  operations  can  be 
demonstrated  by  considering  a  number  of  visual  tasks.  Three  tasks  examined  by 
Ullman  are  shown  in  Figures  2. 2-2. 4.  Figure  2.2  shows  several  examples  of  the 
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inside-outside  task.  In  this  case,  the  objective  is  to  determine  visually  if  the  X 
is  contained  by  a  closed  curve.  A  visual  routine  for  computing  the  inside-outside 
relation,  based  on  “re^on  coloring,*  proceeds  as  follows: 

1.  Move  the  processing  focus  to  the  X  location. 

2.  Begin  a  bounded  activation  at  X  until  no  further  spreading  occurs. 

3.  If  the  activation  fails  to  reach  the  edge  of  the  visual  field,  then  report  that 
X  is  contained  by  a  closed  curve. 


Another  spatial  reasoning  task  that  is  amenable  to  visual  routines  but  difficult 
for  classical  pattern  recognition  techniques,  is  to  determine  whether  or  not  two 
X's  fall  on  the  same  curve  (see  Figure  2.3).  A  visual  routine  for  establishing  this 
property  follows: 

1.  Move  the  processing  focus  to  an  unmarked  X  and  mark  the  location. 

2.  If  the  X  does  not  lie  on  a  curve,  then  go  to  Step  1. 

3.  Trace  along  the  contour  until  another  X  is  encountered  or  the  curve  has 
been  completely  scanned. 

4.  If  a  second  X  is  encountered  along  the  contour  in  Step  3,  then  terrmnate 
and  report  success;  otherwise,  if  unmarked  X's  still  exist,  go  to  Step  1;  else 
terminate  and  report  failure. 


Interestingly,  when  human  subjects  perform  this  task,  they  report  their  per¬ 
ceptions  as  immediate  and  effortless  when  in  fact  reaction  time  experiments  show 
that  the  time  needed  to  perform  such  tasks  monotonically  increase  (nearly  lin¬ 
early)  with  the  distance  traced  along  the  contour  [Ullman,  1984].  This  is  consis¬ 
tent  with  the  tracing  routine  given  above  and  suggests  that,  although  unconscious 
of  it,  people  probably  use  a  sequential  algorithm  that  includes  a  contour  tracing 
operation. 

A  third  example  of  a  visual  task  is  to  identify  a  subfigure  in  a  scene  despite  the 
presence  of  confounding  figures  in  close  proximity  to  its  contours  (see  Figure  2.4). 
This  tcLsk  is  representative  of  recognition  problems  found  in  natural  scenes  where 
cluttered  backgrounds  are  common.  The  task  can  be  accomplished  by  using  a 
bounded  activation  process  to  segment  the  shape,  followed  by  a  shape  analysis 
procedure. 


Figure  2.2:  Examples  of  the  inside-outside  task  (after  [Ullman,  1984]  p.  100). 
Two  simple  instances  of  the  inside-outside  task  are  shown  in  (a)  and  (b).  A  more 
difiBcult  instance  is  shown  in  (c). 


18 


Figure  2.4:  The  objective  of  this  task  is  to  identify  the  subfigure  containing  the 
X  despite  the  presence  of  confounding  figures  in  close  proximity  to  its  contours. 
The  key  to  this  task  is  to  segment  the  subfigure  from  the  cluttered  background. 
This  can  be  accomplished  by  shifting  the  processing  focus  to  the  X  and  initiating 
a  bounded  spreading  of  activation  (after  (Ullman,  1984]  p.  136). 


2.2.4  Indexing 


There  are  two  essential  forms  of  selectivity  in  the  visual  routines  model:  routine 
selection  and  indexing.  Routine  selection  is  the  process  of  constructing  and  se¬ 
lecting  the  visual  routines  to  be  applied  at  a  given  point  in  time.  Other  than  to 
associate  it  with  higher  level  components  of  the  system,  algorithms  for  routine 
selection  are  not  specified  by  the  model.  Indexing  is  the  process  of  determining 
where  in  the  base  representation  to  apply  a  visual  routine. 

Examination  of  the  above  visual  routines  shows  that  processing  does  not  occur 
in  parallel  over  the  entire  base  representation,  but  tends  to  be  localized  in  specific, 
relevant  regions.  For  instance,  the  algorithm  for  finding  two  X’s  on  a  common 
contour  begins  by  focusing  processing  at  locations  occupied  by  an  X.  These 
locations  in  the  base  representation  act  as  “anchor  points”  for  the  routine  and 
establish  context-dependent  referents  used  by  the  visual  routme. 

In  the  visual  routines  model,  indexing  is  accomplished  in  three  stages.  In  the 
first  stage,  a  set  of  indexical  properties  are  computed  in  parallel  over  the  entire  base 
representation.  These  indexical  properties  are  assumed  to  be  locally  computable 
features  such  as  motion,  color,  orientation,  and  curvature.  In  the  second  stage, 
an  odd-man-out  operation  is  performed  to  detect  locations  that  are  significantly 
different  from  their  surroundings.  This  salience  operation  is  mediated  by  selecting 
specific  indexical  properties  to  be  emphasized,  or  satisfied  by  the  chosen  emchor 
point.  In  the  last  stage,  the  processing  focus  is  shifted  to  the  most  salient  location. 

A  cornerstone  of  the  active  vision  paradigm  in  general,  and  the  visual  routines 
model  in  particular,  is  the  idea  of  limiting  the  application  of  visual  processing 
resources  to  only  those  spatial  locations  that  are  functionally  relevant.  From 
a  computational  standpoint,  the  viability  of  this  idea  hinges  on  being  able  to 
quickly  identify  functionally  relevant  locations  in  the  base  representation.  An 
important  assumption  made  by  the  visual  routines  model  is  that  in  most  cases 
locally  computable  features  can  be  used  to  quickly  identify  these  functionally  rel¬ 
evant  regions.  Whether  or  not  this  assumption  stands  up  in  practice  remains 
to  be  seen.  However,  it  appears  that  in  many  cases  local  properties  can  indeed 
be  used  to  differentiate  relevant  locations  from  the  background.  From  a  psycho¬ 
logical  standpoint,  considerable  evidence  exists  to  suggest  that  the  human  visual 
system  employs  some  indexing  mechanisms  based  on  locally  computable  features 
[Treismann  and  Gelade,  1980;  Julesz,  1981].  Recent  work  by  Swain,  Ballard, 
and  Wixson  has  also  demonstrated  the  utility  of  locally  computable  properties 
based  on  color  for  indexing  in  an  active  computer  vision  system  [Swain,  1990; 
Wixson.  1990;  Wixson  and  Ballard,  1991]. 
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2.2.5  Summary  of  the  Visual  Routines  Model 

The  human  visual  system  has  an  uncanny  ability  to  perform,  with  surprising  ease, 
spatial  analysis  tasks  that  from  a  computational  standpoint  are  quite  sophisti¬ 
cated.  Few  computational  models  are  capable  of  explaining  such  sophisticated 
processes.  Ullman’s  visual  routine  model  represents  a  significant  advance  in  this 
regard.  It  has  also  contributed  to  the  recent  shift  in  computational  vision  away 
from  scene  reconstruction  and  towards  active,  task-oriented  perception. 

The  key  features  of  the  Visual  Routines  Model  follow  ([Ullman,  1984]  p.  108): 

1.  Spatial  properties  and  relationships  are  established  by  the  application  of 
visual  routines  to  a  set  of  early  visual  representations. 

2.  Visual  routines  are  assembled  from  a  fixed  set  of  elemental  operations. 

3.  New  routines  are  assembled  to  meet  newly  specified  processing  goals. 

4.  Different  routines  share  elemental  operations. 

5.  A  routine  can  be  applied  to  different  spatial  locations.  The  processes  that 
perform  the  same  routine  at  different  locations  are  not  independent. 

6.  Mechanisms  are  provided  for  sequencing  elemental  operations  and  for  se¬ 
lecting  locations  to  apply  routines. 


2.3  Deictic  Representations 

One  area  that  Ullman’s  visual  routines  model  does  not  address  in  detail  is  the 
interface  between  the  visual  routine  processor  and  higher  level  control  components 
of  the  system.  In  particular,  Ullman’s  model  does  not  answer  these  questions: 
What  information  should  be  encoded  in  the  internal  representations  of  the  higher 
level  components?  or  How  do  the  higher  level  components  control  the  construction 
and  application  of  visual  routines? 

Agre  and  Chapman  have  proposed  one  answer  to  the  question  of  the  content 
of  the  central  representation  in  their  theory  of  deictic  representations  [Agre  and 
Chapman,  1987;  Agre,  1988;  Chapman,  1990b].  The  key  observation  exploited  in 
this  theory  is  that  for  any  particular  task  (or  activity)  there  is  usually,  at  any  given 
point  in  time,  a  relatively  small  number  of  objects  that  are  immediately  relevant 
to  the  agent’s  behavior.  Instead  of  attempting  to  represent  each  and  every  object 
in  the  domain  objectively  (as  is  common  in  traditional  representational  schemes), 
deictic  representations  aim  to  monitor  and  represent  actively  only  those  few  key 
objects  that  are  immediately  relevant  to  the  ongoing  activity.  In  a  deictic  repre¬ 
sentation,  if  some  aspect  of  the  world  is  not  significant  to  the  agent’s  immediate 
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activity  it  is  ignored.  This  task-oriented  approach  to  representation  has  the  axi- 
vantages  that  1)  it  significantly  reduces  the  computational  burden  placed  on  the 
sensory  system  and  2)  it  affords  a  kind  of  “passive  abstraction”  [Agre,  1988]  that 
greatly  facilitates  decision-making  by  collapsing  the  infinite  complexity  of  the  real 
world  onto  much  smaller  task-specific  internal  representations. 


2.3.1  Entities  and  Aspects 

In  a  deictic  representation,  it  is  assumed  that  a  task  (or  activity)  can  be  de¬ 
scribed  abstractly  in  terms  of  continuous  interactions  with,  and  manipulations 
of,  abstract  functional  objects  called  indexical  functional  entities.  That  is,  it  is 
assumed  that  knowledge  of  relevant  features  and  relationships  of  these  few  key 
entities  is  sufficient  to  determine  the  behavior  of  an  agent.  These  abstract  prop¬ 
erties  and  relationships  are  called  indexical  functional  aspects  and  they  comprise 
the  agent’s  internal  representation. 

A  simplistic  (but  intuitive)  way  to  think  about  entities  is  to  view  them  as 
functional  roles  that  must  be  instantiated  by  objects  in  the  world  in  the  course 
of  actually  performing  a  particular  task.  That  is,  when  an  agent  actually  engages 
in  an  activity,  entities  get  bound  to  specific  objects  in  the  world,  and  the  actual 
behavior  of  the  agent  (in  a  particular  instantiation  of  an  activity)  is  determined 
by  the  actual  properties  and  relationships  (aspects)  of  these  bound  objects.  At 
any  given  point  in  time,  an  entity  is  associated  with  at  most  one  object  in  the  real 
world.  However,  over  the  course  of  time,  in  various  instantiations  of  an  activity  or 
in  different  stages  of  an  activity,  an  entity  may  bind  many  different  objects.  The 
point  is  that  entities  correspond  to  functional  roles  in  an  activity,  which  may  be 
played  by  a  wide  range  of  objects,  depending  upon  the  particular  circumstances 
in  the  external  world.  In  this  discussion,  we  focus  on  indexical  functional  entities 
that  represent  physical  objects  in  the  external  world.  These  simple  entities  are 
analogous  to  natural  kinds  (or  simple  nouns)  used  in  linguistics.  However,  just 
as  there  are  nouns  to  describe  complex,  abstract  concepts,  sr  too  can  complex 
indexical  functional  entities  be  defined  to  represent  them. 

Aspects  describe  important  properties  of  entities.  These  may  include  rela¬ 
tively  simple  features,  such  as  color,  texture,  orientation,  and  position  in  space, 
or  arbitrarily  complex  properties  and  relationships  such  as  shape,  relative  size, 
inside-outside  relationships,  overlap,  alignment,  distance,  or  just  about  any  con¬ 
dition  imaginable.  In  addition  to  determining  the  overall  state  of  the  agent  with 
respect  to  a  task,  aspects  are  also  useful  for  instantiating  specific  motor  com¬ 
mands.  For  instance,  the  position  of  an  entity  relative  to  the  agent’s  body  may 
be  used  to  establish  the  reference  frame  used  during  reaching  [Ballard,  1989b]. 
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2.3.2  Implementing  Deictic  Representations 

In  general,  in  a  deictic  representation,  there  are  many  ways  to  establish  bindings 
between  entities  and  aspects  in  the  agent’s  internal  representation  and  specific 
objects  and  properties  in  the  external  physical  world.  For  instance,  the  binding 
between  an  entity  such  as  the-cup-from-which~I-am-drinking  and  a  physical  object 
(say  the  actual  cup  I  am  drinking  from)  can  be  established  through  visual  cues 
(e.g.,  visual  fixation),  through  haptic  cues  (e.g.,  grasping  it  with  one’s  hand),  or 
even  through  memory  (e.g.,  by  remembering  the  values  of  aspects/properties  that 
characterize  it). 

It  is  also  important  to  point  out  that,  it  is  not  necessary  for  the  agent  to  con¬ 
tinually  maintain  every  object-entity  binding  associated  with  a  specific  activity. 
Bindings  may  be  dynamic  and  may  depend  upon  the  status  of  the  agent’s  ongoing 
activity  and  upon  conditions  in  the  world.  In  general,  what  is  required  between 
an  object  and  an  entity  is  that  the  agent  maintain  a  causal  relationship  between 
the  two  so  that  when  the  time  comes  to  establish/exploit  the  binding,  it  can  be 
made.  Such  causal  relationships  can  be  maintained  through  behavioral  conven¬ 
tions,  habits,  policies,  laws,  juxtapositions,  etc.  See  Chapter  7  of  [Agre,  ming]  for 
a  complete  discussion. 

For  the  purposes  of  this  dissertation,  we  restrict  ourselves  to  simple  cases 
where  entities  are  always  simple  (i.e.,  bound  to  physical  objects  in  the  external 
world)  and  vision  alone  is  used  to  establish  object-entity  bindings.  In  the  visual 
systems  described  in  [Agre,  1988;  Chapman,  1990b],  an  object  is  bound  to  entity 
using  a  mechanism  known  as  a  marker.  Markers  are  best  thought  of  as  pointers 
implemented  by  the  visual  system  and  are  quite  similar  to  the  “processing  foci” 
describe  in  Ullman’s  model.  A  marker  can  be  bound  to  only  one  object  at  a  time 
and  it  is  assumed  that  the  sensory-motor  system  maintains  the  marker’s  binding 
at  all  times.  Changing  a  marker’s  binding  is  accomplished  by  executing  explicit 
actions  specifically  targeted  for  that  marker.  These  actions  index  target  objects 
in  the  world  according  to  specific  indexical  properties  that  distinguish  them  from 
other  objects,  as  in  Ullman’s  model. 

It  is  important  for  an  agent  to  establish  object-entity  bindings  as  quickly  and 
efficiently  as  possible.  To  facilitate  binding,  each  entity  has  associated  with  it  a 
set  of  indexical  properties/routines  that  are  used  to  guide  the  search  for  candidate 
objects.  Once  a  candidate  object  has  been  identified,  additional  aspects  can  be 
computed  using  more  complex  visual  routines  to  determine  whether  or  not  it 
meets  the  functional  constraints  of  the  entity. 

2.3.3  Control 

As  in  Ullman’s  model,  Agre  and  Chapman’s  theory  assumes  that  higher  level 
(decision)  components  control  the  selection  of  visual  routines  and,  consequently. 
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the  generation  of  the  agent’s  internal  representation.  In  the  systems  they  imple¬ 
mented,  the  higher  level  decision  components  were  hand-crafted  combinational 
circuits,  and  other  than  to  emphasize  the  need  for  continuous  interaction  between 
perception  and  control,  no  formal  theory  of  high  level  control  was  provided. 

2.3.4  Instantiations  of  the  Theory 

Agre  and  Chapman  wrote  a  pair  of  programs  that,  among  other  things,  demon¬ 
strated  the  utility  of  the  visual  routines  model  and  of  deictic  representations 
in  complete  behaving  systems[Agre  and  Chapman,  1987;  Agre,  1988;  Chapman, 
1990b].  The  programs  play  two  different  video  games.  The  first  program,  called 
Pengi,  plays  Pengo;  and  the  second  program,  called  Sonja,  plays  Amazon.  I’ll 
focus  on  Pengi. 

Pengo  is  an  interactive  video  game  in  which  a  penguin  moves  around  in  a 
maze  of  ice  blocks  that  is  also  occupied  by  bees.  The  player  (Pengi)  controls  the 
penguin’s  movements  with  a  joystick  while  the  bees  fly  around  semi-randomly. 
The  penguin  can  be  stung  and  killed  by  bees  that  get  too  close.  Also,  the  penguin 
and  bees  can  modify  the  maze  by  kicking  ice  blocks  to  make  them  slide.  If  a  block 
slides  into  a  bee  or  the  penguin  it  dies.  Figure  2.5  shows  a  snapshot  of  a  Pengo 
game  in  progress.  In  the  figure,  the  penguin  can  be  found  in  the  middle  left  part 
of  the  screen. 

In  Pengi,  entities  and  aspects  are  used  to  identify  ice  cubes  and  bees  (and  other 
things  like  corridors  and  openings)  that  are  relevant  to  the  player’s  behavior. 
At  various  times  during  play,  Pengi’s  internal  representation  monitors  different 
entities  and  aspects,  depending  upon  the  particular  activity  engaging  Pengi.  Some 
of  the  entities  that  Pengi  keeps  track  of  at  various  times  during  the  game  are: 


the-ice-cube-l-am-kicking, 

ihe-bee-I-am-chasing^ 

the-bee-on-the-other~side-of-this-ic€~cube~next-to-me^ 

the-ice-cube-that~the-ice-cube-I-just~kicked-will~collide-with. 


Space  does  not  permit  a  detailed  discussion  of  the  visual  routines  used  to  identify 
these  entities  or  compute  their  eispects.  However,  elegant  discussions  of  the  theory 
of  deictic  representations  and  its  realization  in  Pengi  and  Sonja  can  be  found  in 
incarnations  of  Agre’s  and  Chapman’s  doctoral  dissertations,  [Agre,  1988;  Agre, 
ming]  and  [Chapman,  1990b;  Chapman  and  Kaelbling,  1991],  respectively. 
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3  Background:  Reinforcement 
Learning 


This  chapter  reviews  some  of  the  principles  of  reinforcement  learning.  The  main 
objective  of  the  chapter  is  to  review  Watkins’  Qdearning  algorithm;  however,  we 
would  also  like  to  establish  some  intuition  into  the  workings  Qdearning  and  relate 
it  to  the  mathematical  foundations  on  which  it  is  based.  Therefore,  we  begin  by 
reviewing  the  Theory  of  Markov  Decision  Processes  and  the  fundamentals  of  Dy¬ 
namic  Programming.  We  then  show  how  Q-learning  can  be  derived  as  a  kind  of 
incremental,  Monte  Carlo  version  of  the  policy  iteration  algorithm  of  dynamic  pro¬ 
gramming.  More  thorough  treatments  of  the  Theory  of  Markov  Decision  Processes 
and  Dynamic  Programming  can  be  found  in  numerous  good  books  (e.g.,  [Bellman, 
1957;  Ross,  1983;  Bertsekas,  1987))  and  an  excellent  account  of  Q-learning  can  be 
found  in  W'atkins’  PhD  dissertation  [Watkins,  1989). 


3.1  Markov  Decision  Processes 

We  are  interested  in  agents  that  learn  to  perform  tasks  that  require  interactions 
with  the  world  over  an  extended  period  of  time.  These  tasks  require  the  agent  to 
use  sensors  to  differentiate  states  of  the  world  and  to  decide,  depending  upon  the 
situation,  which  action  to  take.  Sometimes  actions  will  achieve  a  desired  end  (or  a 
goal)  but  most  of  the  time  actions  will  be  used  to  preserve  some  desired  property 
of  the  world  or  to  set  up  opportunities  to  achieve  goals  in  the  future. 

Markov  decision  processes  provide  a  useful  mathematical  framework  to  de¬ 
scribe  these  sequential  decision  tasks.  In  a  Markov  decision  process  (also  called 
a  controllable  Markov  chain)  a  control  task  is  formally  modeled  by  the  tuple 
(S,  A,  T,  R).  In  this  model,  S  is  the  set  of  possible  states  the  world  can  occupy,  A 
is  the  set  of  possible  actions  the  agent  (or  controller)  can  take,  T  is  a  transition 
function  that  determines  the  effects  of  actions,  and  is  a  reward/cost  function 
that  associates  payoffs  and  penalties  with  actions  and  states.  In  this  dissertation 
we  will  assume  that  the  set  of  possible  world  states,  S,  and  the  set  of  available 
actions,  A,  are  discrete  and  finite. 
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In  a  Markov  decision  process,  time  advances  in  unit  duration  quanta  {t  = 
0, 1, 2, 3, 4,  ...)t  where  at  each  point  in  time  the  system  occupies  exactly  one  state;^ 
At  each  point  in  time,  the  agent  selects  and  executes  one  action,  which  causes  the 
world  to  make  a  transition  to  a  new  state  and  results  in  the  receipt  of  a  reward 
(or  penalty).  At  any  given  time  t, 

Xt  is  the  random  variable  denoting  the  state  of  the  system, 

Xt  is  the  actual  state  of  the  system, 

Rt  is  the  random  variable  denoting  the  reward  received,  (i.e., 
as  a  result  of  executing  an  action  in  state  Xt), 
rt  is  the  actual  reward  received, 
at  is  the  action  executed  by  the  controller. 


The  effects  of  an  action  depend  only  upon  the  state  in  which  it  was  performed. 
This  dependence  is  modeled  by  the  transition  function  T,  such  that  T{xuat)  - 
Xt.fl.  For  deterministic  models,  the  mapping  from  state-action  pairs  to  next 
states  is  fixed  and  the  transition  function  is  a  true  function.  However,  in  general 
transitions  are  allowed  to  be  probabilistic.  In  this  case,  the  result  returned  by  T 
is  a  sample  drawn  from  a  probability  distribution  over  S.  In  a  Markov  decision 
process  the  probability  distributions  that  govern  the  transition  function  depend 
only  upon  the  action  and  the  state  in  which  it  was  performed,  and  the  transition 
function  is  fully  specified  by  a  set  of  transition  probabilities  P*v(°)>  where 

P,,(a)  =  Pr(T{x,a)=ry).  (3.1) 

Similarly,  the  reward  function  R  determines  the  reward  received  as  a  result  of 
executing  an  action.  It,  too,  is  a  probabilistic  function  of  the  current  state-action 
pair,  thus,  Rt  =  R{xt,at).  While  a  set  of  probability  distributions  governing 
the  reward  function  could  be  defined  for  each  state  action  pair,  analogous  to 
the  transition  func‘‘on,  we  are  usually  only  concerned  with  the  expected  reward 
obtained  as  a  result  of  executing  an  action.  Thus,  we  will  assume  instead  that  the 
Markov  Decision  Process  defines  an  expected  reward  with  each  state  action  pair, 
which  we  write  as 

p(i,a)  =  E(i?(x,a)].  (3.2) 

This  completes  the  formal  definition  of  a  Markov  Decision  Process. 

3.1.1  The  Markov  Property 

A  key  property  of  the  Markov  Decision  Process  model  is  that  transitions  and 
rewards  depend  only  upon  the  current  state  and  the  current  action.  That  is  not 

*  Actually,  continuous  time  and  continuous  state  Markov  decision  processes  can  be  defined. 
However,  our  discussion  will  restrict  itself  to  discrete  time  and  discrete  state  processes. 
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to  say  that  knowledge  of  the  current  state  and  current  action  is  sufficient  to  predict 
the  reward  and  next  state  —  transitions  and  rewards  may  be  non-deterministic 
—  but  that  any  information  above  and  beyond  knowledge  of  the  current  state 
and  current  action  is  useless  for  making  predictions.  This  is  called  the  Markov 
property.  It  is  a  very  strong  condition  because  it  says  that  knowledge  of  the 
current  state-action  pair  captures  the  essence  of  the  task  and  that  any  additional 
knowledge  of  the  world  whatsoever  has  no  impact  on  one’s  ability  to  predict  the 
outcome  of  executing  the  state-action  pair. 

It  is  important  to  point  out  that  the  Markov  property  is  not  an  intrinsic 
property  of  any  physical  process;  rather  it  is  a  property  of  a  mathematical  model 
of  a  physical  process.  Virtually  any  physical  process  can  be  modeled  as  a  Markov 
process  if  the  state-space  and  control  actions  are  chosen  appropriately. 

The  Markov  property  is  crucial  when  considering  the  effects  of  active  percep¬ 
tion  on  the  decision  problems  faced  by  an  agent’s  embedded  control  system.  In 
particular,  if  the  state  space  is  defined  by  the  possible  values  of  the  agent’s  sensory 
inputs  and  if  the  sensory  system  does  not  provide  adequate  information  about  the 
state  of  the  world,  then  the  decision  problem  facing  the  agent’s  embedded  con¬ 
troller  may  not  satisfy  the  Markov  property.  As  we  will  see  in  Chapter  5  this 
can  be  catastrophic  since  1)  it  may  lead  to  decision  problems  that  have  no  fixed 
optimal  policy  and  2)  it  may  cause  reinforcement  learning  algorithms  to  oscillate 
from  one  policy  to  another,  never  converging  on  an  optimal  strategy. 

3.1.2  The  Objective  of  Control 


Policies 

Markov  decision  processes  are  also  called  Controllable  Markov  Processes  because 
one  essential  component  of  these  models  is  a  controller  that,  by  choosing  actions, 
can  influence  (and  in  some  cases  directly  control)  the  state  space  trajectory  of  the 
world  through  time.  One  w’ay  to  specify  the  behavior  of  the  controller  is  with  a 
policy.  A  policy  is  a  rule  that  determines  the  action  to  execute  given  the  current 
state.  Formally,  a  policy  /  is  a  function  from  states  to  actions  {/  :  S  — »  A),  where 
f{x)  denotes  the  action  to  be  executed  in  state  x. 

A  policy  is  stationary  if  the  mapping  from  states  to  actions  is  fixed.  If  the 
function  is  probabilistic  then  the  policy  is  said  to  be  stochastic,  otherwise  it  is 
deterministic.  Given  a  Markov  decision  process,  the  objective  is  to  find  (either 
by  direct  computation  or  by  learning)  a  deterministic,  stationary  policy  that  is  in 
some  sense  optimal. 

Let  us  denote  the  probability  that  the  process  makes  a  transition  from  state 
X  to  state  y,  given  that  Che  agent  is  following  policy  /,  as 

Pryif).  (3.3) 
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Similarly,  let  us  use 


T(x,f),  and  /Ki,P  (3.4) 

to  denote  the  random  variables  for  the  next  state,  the  reward  and  the  expected 
reward,  respectively,  when  following  poliqr  /  in  state  x. 

Returns 

In  the  Theory  of  Markov  decision  processes,  optimality  is  defined  in  terms  of  cu¬ 
mulative  reward  received  over  time.  That  is,  the  goal  of  the  agent  is  to  implement 
a  control  policy  that  on  average  maximizes  some  measure  of  the  total  reward  to 
be  received  in  the  future. 

There  are  several  possible  measures  of  cumulative  reward.  One  of  the  most 
common  is  a  measure  based  on  a  discounted  sum  of  the  total  reward  received  over 
time.  This  sum  is  called  the  return  and  for  time  t  is  defined  as 

=  (3.5) 

n=0 

where  7,  called  the  discount  factor,  is  a  constant  between  0  and  1.  In  this  measure, 
the  discount  factor  is  used  to  weight  the  importance  of  a  reward  as  a  function  of 
its  distance  into  the  future.  Setting  7  =  0  leads  to  shortsighted  policies  that  are 
interested  only  in  the  most  immediate  reward.  Setting  the  discount  factor  near 
one  leads  to  returns  that  weigh  rewards  some  time  into  the  future  and  results 
in  optimal  policies  that  may  forgo  immediate  rewards  in  order  to  set  up  larger 
rewards  down  the  road.  For  any  7  <  1  the  effects  of  rewards  far  into  the  future 
eventually  become  negligible  and  the  return  is  finite.  Because  a  process  may  be 
stochastic,  the  objective  is  to  find  a  policy  that  maximizes  the  expected-  return. 

Optimality  Criterion 

For  a  fixed  policy  /,  define  Vj{x)  to  be  the  expected  return,  given  that  the  process 
begins  in  state  x  and  follows  policy  /  thereafter. 

To  define  Vj  more  precisely  we  must  introduce  more  notation.  Let  X{x,fyn) 
be  a  random  variable  denoting  the  state  that  results  from  starting  in  state  x  and 
following  policy  /  for  n  steps.  Note  that  X{x,f,0)  =  x  and  .Y(x,/,  1)  =  T{x,f). 
Similarly,  let  R{x,f,n)  be  the  random  variable  denoting  the  reward  received  at 
time  t  +  n  after  starting  in  state  x  and  following  policy  /  for  n  -}- 1  steps.  So, 
for  example,  i?(x,/,0)  =  i?(x,/).  Also,  we  will  sometimes  use  X(x,a,  1)  as 
alternative  notation  for  T(x,a).  Given  these  definitions,  Vj  is  defined  as 

Vj{x)  =  E[i?(x,/,0)  4-  7/?{x,/,l)  -f  7^i?(x,/,2)  -f  ...  -|-  7"i?(x,/,n)  +  ...].  (3.6) 
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Vj  is  called  the  value  function  for  policy  /  and  it  can  be  used  to  define  an 
optimality  criterion.  Namely,  given  a  description  of  a  Markov  decision  process, 
the  objective  is  to  find  a  policy  whose  value  function  is  uniformly  maximal  for 
every  state.  That  is,  find  a  policy  /“  such  that 

V/.(x)  =  m|«(V>(x))  V,€s.  (3.7) 

Intuitively,  an  optimal  policy  is  any  control  strategy  that  always  maximizes 
the  expected  return,  no  matter  what  state  the  process  finds  itself  in  and  no  matter 
what  the  history  of  the  process  may  be. 

3.1.3  Example 

To  illustrate  how  sequential  tasks  can  be  formulated  as  Markov  decision  processes, 
consider  the  task  shown  in  Figure  3.1.  In  this  task,  we  want  the  robot  to  navigate 
from  an  initial  state  5,  through  the  maze,  to  the  gocil  state  G.  The  robot  can  take 
unit  length  steps  in  each  of  the  four  principal  compass  directions  and  cannot  pziss 
through  barriers.  For  simplicity  we  will  assume  that  actions  are  deterministic. 

A  formulation  of  this  task  as  a  Markov  decision  process  is  shown  in  Figure  3.2. 
In  this  formulation  the  robot  receives  a  positive  reward  only  when  it  executes  the 
action  that  causes  it  to  enter  the  goal  state.  In  general  the  reward  function  must 
be  carefully  chosen  if  the  optimal  policy  associated  with  the  decision  process  is  to 
match  the  desired  behavior.  One  particularly  convenient  approach  is  to  identify  a 
set  of  desirable  goal  states  and  to  reward  the  agent  whenever  it  encounters  one  of 
these  states.  Following  this  approach,  the  optimal  policy  usually  corresponds  to 
taking  actions  that  achieve  the  goal  in  the  least  number  of  steps.  Of  course  for  a 
given  task,  or  behavioral  interaction,  there  are  an  infinite  number  of  formulations 
whose  optimal  policies  will  yield  the  desired  behavior. 

The  optimal  policy  and  value  functions  for  the  formulation  given  in  Figure  3.2. 
is  shown  in  Figure  3.3.  The  optimal  policy  is  indicated  with  an  arrow  in  each 
square.  The  values  for  V}.  are  shown  in  the  lower  right  corner  of  each  tile.  These 
values  are  for  7  =  0.9.  Notice  that  the  policy  is  defined  over  the  entire  state 
space  and  that  following  the  policy  from  any  location  traces  out  a  shortest  path 
to  the  goal  state.  Also  notice  that  the  value  function  monotonically  increases  as 
the  robot  approaches  the  goal  state.  Indeed,  for  this  formulation  of  the  task,  the 
optimal  trajectory  follows  the  gradient  in  the  value  function. 


3.2  Dynamic  Programming 

At  this  point  Markov  decision  processes  have  been  defined  and  shown  to  be  useful 
for  formalizing  sequential  control  tasks.  However,  a  crucial  question  has  not  been 
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Figure  3.2:  A  formalization  of  the  navigation  task  as  a  Markov  decision  process. 
Shown  is  the  state  space  S,  the  set  of  possible  actions  A,  the  reward  function  il, 
and  the  transition  function  T.  In  this  model,  there  are  19  states,  4  actions,  and 
the  agent  receives  a  reward  wlien  it  first  enters  the  goal  state. 
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Figure  3.3:  The  optimal  policy  and  optimal  value  function  for  the  navigation  task. 
The  direction  of  the  arrow  indicates  an  optimal  action  to  take  when  occupying 
a  given  tile.  Values  for  the  optimal  value  function  are  shown  in  the  lower  right 
corner  of  each  tile.  Notice  that  the  value  function  is  inversely  proportional  to  the 
distance  to  the  goal  and  that  the  optimal  policy  follows  the  gradient  in  the  value 
function. 
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addressed:  How  do  we  find  an  optima,!  policy?  If  all  of  the  parameters  of  the 
Markov  decision  process  are  known  precisely,  then  Dynamic  Programming  can 
be  used  to  compute  an  optimal  policy.  This  section  reviews  essential  ideas  and 
techniques  of  dynamic  programming. 


3.2.1  The  Brute  Force  Method 

One  way  to  obtain  the  value  function  for  a  given  policy  is  to  estimate  it  by 
Monte  Carlo  simulation.  That  is,  Vf{x)  can  be  estimated  accurately  by  repeatedly 
starting  the  process  off  in  state  a:,  letting  it  follow  policy  /  for  some  (possibly 
long)  period  of  time,  and  accumulating  a  return  for  each  trace.  Since  each  trace 
is  independent,  the  law  of  large  numbers  guarantees  that  an  average  of  the  trace 
returns  will  eventually  converge  to  the  expected  value  (V/{x)). 

A  better  way  to  compute  the  value  function  is  to  compute  it  directly  by  noticing 
that  it  can  be  defined  recursively.  In  particular,  the  definition  of  Vj{x) 

V,{x)  =  /,  0)  +  -rR(,x,  /,  1)  +  /,  2)  + ...  +  t’Rix,  f,  n)  + ...)  (3.8) 

can  be  written  recursively  as 

V,(x)  =  Elfi(i,/,0))  +  7E|V/(X(x. /,!)))  (3.9) 

or  equivalently  as 

Vj{x)  =  p{xJ)^^E[Vf{X{xJ,l))].  (3.10) 

For  finite  state  processes,  the  expectation  on  the  right  hand  side  can  be  written 
as  a  sum  over  a  finite  number  of  states,  leading  to  the  expression 

ViM  =  />(*./)  +  7  E  P-M)V)(y)-  (3-11) 

yeS 

Equation  3.1 1  defines  a  family  of  linear  equations  which  can  be  solved  for  V/  when 
combined  with  the  constraint  expressions 

E  P’vif)  =  (3-12) 

y€S 

and  the  values  for  p{x,a)  and  Pxy{a)  provided  by  the  model. 

Thus,  given  a  finite  state,  finite  action  Markov  decision  process  and  any  fixed 
policy,  the  value  function  for  that  policy  can  be  computed  directly;  although 
solving  the  linear  equations  may  be  time  consuming. 

Given  this  means  for  computing  a  policy’s  value  function,  one  way  to  find  an 
optimal  policy  is  to  enumerate  all  possible  policies,  compute  the  value  function  for 
each,  and  choose  one  that  is  uniformly  maximal.  Unfortunately,  this  exhaustive 
approach  is  computationally  intractable  even  for  small  problems  since  the  number 
of  possible  policies  grows  exponentially  in  the  state  space  size  and  the  number  of 
possible  actions. 
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3.2.2  Policy  Iteration 

Dynamic  programming  offers  a  much  cheaper  (but  generally  still  expensive)  method 
for  solving  Markov  decision  problems.  In  dynamic  programming,  instead  of  com¬ 
paring  all  possible  policies  in  search  of  an  optimal  one,  an  arbitrary  policy  is 
initially  selected  and  then  incrementally  improved  until  it  is  optimal. 

Suppose  we  have  two  policies  /  and  g,  and  we  would  like  to  know  if  g  is 
uniformly  better  than  /.  One  way  to  answer  this  question  is  to  compute  Vj  and 
Vg  as  above  and  compare  them  directly.  A  less  expensive  method,  which  only 
requires  the  full  computation  of  V},  follows. 

Suppose  that  in  any  initial  state  .t,  instead  of  always  following  policy  /,  we  first 
follow  policy  g  for  one  step  and  then  switch  back  to  /,  which  we  follow  thereafter. 
Consider  the  expected  return  that  results  from  such  a  strategy.  In  particular,  let 
Qf(Xyb)  denote  the  expected  return,  given  that  starting  in  state  x  the  process 
executes  action  b  and  then  follows  policy  /  thereafter.  Qj  is  called  the  action- 
value  function  for  policy  /.  The  action-value  function  can  be  expressed  in  terms 
of  Vf  as  follows: 

(5/(a;,6)  =  p(x,fe)-f  7  Prv(6)V/(y).  (3.13) 

yebjs 

Given  Equation  3.13,  (?/(a;,5f(a;))  is  much  simpler  to  compute  (given  that  we  know 
V/)  than  Vg. 

Given  the  action-value  function,  we  can  determine  if  g  is  better  than  /.  In 
particular,  suppose  that  the  expected  return  obtained  by  following  g  for  one  step 
and  then  switching  back  to  /  is  uniformly  as  good  as  or  better  than  the  expected 
return  obtained  by  following  /  only.  That  is,  suppose  that  for  all  x  €  S, 

(3.14) 

Then  by  induction  g  is  uniformly  better  than  /.  This  can  be  seen  by  considering 
the  expected  return  that  results  from  following  g  for  n  steps  eind  then  switching 
back  to  /.  Denote  the  resulting  expected  return  as  Qj{x^g,n).  For  n  =  1, 

Qj{x,g,n)  =  Qj{x,g{x))  =  p{x,g{x))  +  -y  ^  Pxy{9{x))Vf{y).  (3.15) 

V€6/S 

It  is  given  that  Qf{x,g{x))  >  Vj{x),  so  for  n  =  1  we  have 

1)  >  V/(x)  for  all  X.  (3.16) 

For  the  inductive  step,  assume  Q/(x,p,n)  >  V/(x)  for  all  x.  Now  in  general  we 
can  write 


Qf{x,9,n  -f  1)  =  /3(x,5(x))-f7  J^P^y(5(x))Q/(y,5,n).  (3.17) 

yeS 


(3.18) 


But  since  Qj{y,g,n)  >  Vj{y)  foi'  all  y,  it  follows  that 

Qf{x,g,  n  +  1)  >  p{x,g{x))  +  7  S  ^xy{9{x))Vf{y). 

veS 

The  right  hand  side  of  this  equation  is  just  Qf{x,g,l),  which  by  assumption  is 
greater  than  or  equal  to  Vj.  So  we  have 

Q/{x,g^n  +  1)  >  Vf  for  all  x  and  n.  (3.19) 

Of  course,  14(a:)  is  the  limit  as  n  goes  to  infinity  of  Qj{x,g,n),  so  Vg{x)  >  Vf{x) 
for  all  X.  This  result  is  summarized  by  the  policy  improvement  theorem  [Bellman, 
1957;  Bertsekas,  19S7;  Watkins,  1989). 

Theorem  1  (Policy  Improvement  Theorem) 

Let  f  and  g  be  policies,  and  let  g  be  chosen  so  that 

Qj{x,g{x))  >  Vj{x)  for  all  i  €  S.  (3.20) 

Then  it  follows  that  g  is  uniformly  better  than  f.  That  is, 

1^(1)  >  Vf{x)  for  all  X  QS  (3.21) 

The  policy  improvement  theorem  is  a  cornerstone  of  dynamic  programming 
because  it  provides  a  relatively  efficient  means  of  finding  better  and  better  policies. 
Beginning  with  a  policy  /,  a  better  policy  /'  can  be  found  by  1)  computing  Vj, 
2)  calculating  the  action-value  function  Qj{x,a)  for  all  state-action  pairs  (i,a)  € 
S  X  A,  and  3)  defining  /'  at  each  state  by  choosing  the  action  that  maximizes  the 
action-value  function.  That  is,  for  all  x, 

f'{x):=a  such  that  Qj{x,a)  =  ma,xQf{x,b).  (3.22) 

6cA 

By  the  policy  improvement  theorem  /'  is  guaranteed  to  be  uniformly  as  good  as 
or  better  than  /. 

This  process  can  be  repeated  over  and  over  again  until,  after  a  finite  number 
of  iterations,  /  no  longer  changes.  At  this  point,  the  final  policy  /*  will  satisfy 
the  optimality- equation 

f*{x)  =  a  such  that  Q/.(a:,a)  =  maxQ/*(i,  6).  (3.23) 

Even  though  we  know  each  iteration  improves  /,  is  it  pos  -ie  that  the  terminal 
policy  /*  is  non-optimal?  The  Optimality  Theorem  tells  us  that  indeed  /*  is 
optimal.  The  following  version  of  the  Optimality  Theorem  is  taken  from  [Watkins, 
1989);  the  proof  of  a  similar  version  can  be  found  in  [Bellman,  1957). 
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/  :=  an  arbitrary  initial  policy 
Repeat 

1.  calculate  the  value  function  Vj, 

2.  calculate  the  action-values  Qf{x,a)  for  all  x,a, 

3.  update  the  policy  for  all  x  by  choosing  f{x)  =  a  such  that 
Qj{x,a)  =  max66A  Qj{x,b) 

until  there  is  no  change  in  f  at  step  3. 


Figure  3.4:  The  Policy  Iteration  Algorithm. 

Theorem  2  (Optimality  Theorem) 

Let  a  policy  /’  have  an  associated  value  function  V*  and  an  action-value  func¬ 
tion  Q*.  If  the  policy  f”  cannot  be  further  improved  by  using  the  policy  improve¬ 
ment  theorem,  that  is  if 

V*(x)  =  maxQ*(a:,6)  (3.24) 

fc€A 

and 

/‘(i)  =  a  such  that  Q*(x,a)  =  V*(x)  (3.25) 

for  all  X  €  S,  then  V"  and  Q*  are  the  unique,  uniformly  optimal  value  and  action- 
value  functions,  respectively,  and  /*  t5  an  optimal  policy.  The  optimal  policy  /*  is 
unique  unless  there  are  states  at  which  there  are  several  actions  with  the  maximal 
action-value,  in  which  case  any  policy  that  recommends  actions  with  the  maximal 
action  value  according  to  Q*  is  optimal. 

A  summary  of  the  policy  iteration  algorithm  is  shown  in  Figure  3.4.  This 
algorithm  is  guaranteed,  via  the  policy  improvement  and  optimality  theorems,  to 
converge  on  an  optimal  policy  for  any  finite  Markov  decision  problem. 

3.2.3  Value  Iteration 

The  policy  iteration  algorithm!  is  an  improvement  over  exhaustive  search;  however, 
it  is  still  expensive.  Note  that  the  value  function  must  be  recalculated  in  each 
stage.  Even  though  the  value  function  for  the  new  policy  may  be  very  similar  to 
the  old  one,  it  cannot  easily  be  derived  from  it,  but  instead  must  be  computed  by 
solving  the  set  of  linear  equations  given  by  Equations  3.11  and  3.12. 


Value  iteration  is  another  computational  technique  for  solving  Markov  deci¬ 
sion  problems  that  is  often  more  efficient  than  policy  iteration.  In  value  iteration, 
instead  of  repeatedly  computing  the  value  functions  for  a  series  of  ever  improving 
non-optimal  policies,  the  optimal  policy  is  obtained  by  directly  solving  the  opti¬ 
mality  equation  (Equation  3.23)  for  a  series  of  finite-horizon  tasks.  As  the  horizon 
goes  to  infinity,  the  optimal  policies  of  the  finite  horizon  problems  converge  to  the 
optimal  policy  for  the  original  Markov  decision  problem. 

In  a  finite  horizon  problem,  the  objective  is  find  a  policy  that  uniformly  max¬ 
imizes  the  expectation  of  a  return  that  is  truncated  at  the  horizon  boundary.  For 
example,  when  the  horizon  equals  one,  n  =  1,  the  goal  is  to  find  a  policy  that 
maximizes  the  expected  reward  received  after  one  step.  For  n  =  2,  we  want  to 
maximize  the  cumulative  reward  received  after  two  steps,  and  so  on. 

Let  V"  denote  the  optimal  value  function  for  an  n-step  finite  horizor  problem. 
Since  rewards  are  only  received  after  executing  an  action,  we  have  V°(a:)  =  0 
for  all  X.  V^{o:)  can  be  determined  by  choosing  the  action  that  has  the  highest 
expected  reward,  namely 

V^{x)  =  m&x  p{x,a).  (3.26) 

agA 

Similarly,  V"(a:)  can  be  expressed  recursively  as  the  sum  of  the  expected  reward 
received  after  one  step  plus  the  discounted  expected  return  for  an  n  —  1  horizon 
problem  that  follows.  That  is, 

V”(x')  =  max(p(x,a)  +  ^E[V^-\T{x,a))],  (3.27) 

a€A 

where 

Eir-‘  t.)))  =  P.MV"-'].  (3.28) 

veS 

The  optimal  policy  for  a  finite  horizon  problem  is  obtained  by  choosing  the 
action  that  achieves  the  maximum  in  Equation  3.27.  If  /”  denotes  the  optimal 
pjlicy  for  an  n-step  finite  horizon  problem,  then 

/"(x)  =  argmax[p(x,a)  -|-  7E[V""^(r(x,a))].  (3.29) 

a^A 

By  starting  at  n  =  0,  V"  can  be  computed  by  repeatedly  applying  Equa¬ 
tion  3.27.  For  finite  rewards  and  7  <  1  as  n  — f  oo,  jV”  —  V*|  -+  0.  Similarly, 
since  V"  converges  to  V*,  /"  converges  to  /*  (see  Ross  for  further  details  [Ross, 
1983]). 

The  value  iteration  algorithm  is  summarized  in  Figure  3.5.  Notice  that  each 
stage  of  the  process  is  computationally  equivalent  to  calculating  the  action-value 
function  in  policy  iteration;  however,  in  value-iteration  the  value  function  for  the 
next  stage  is  determined  automatically  and  we  don’t  have  to  compute  it  by  solving 
a  set  of  linear  equations. 


^“(x)  =  0  for  all  X, 

Q°(x,a)  =  0  for  all  (x,a)  €  S  x  A 
n  =  0, 

Repeat 
n  =  n  +  1, 
for  each  x  €  S  do: 
for  each  a  €  A  do: 

Q”(x,  g)  =  /9(x,  a)  +  7  Eyes  /  ry(a)  {y), 

V"(x)  =  maXaeAQ”(x,a) 

/”(x)  =  argmaXo6AQ"(a;,a) 

until  the  difference  betueen  V"  and  is  sufficiently  small  for  all  x. 


Figure  3.5:  The  Value  Iteration  Algorithm. 

3.3  Q-Learning 

There  is  a  close  relationship  between  reinforcement  learning  and  dynamic  pro¬ 
gramming.  In  both,  the  world  is  characterized  by  a  set  of  states,  a  set  of  actions, 
and  a  reward  function.  In  both,  the  objective  is  to  find  a  decision  policy  that  max¬ 
imizes  the  expected  cumulative  reward  received  over  time.  There  is  an  important 
difference  though.  When  solving  a  Markov  decision  problem  with  dynamic  pro¬ 
gramming,  the  analyst  (presumably  the  designer  of  the  eventual  control  system) 
has  a  complete  (albeit  stochastic)  model  of  the  environment’s  behavior.  Given 
this  information,  the  analyst  directly  comput':s  the  optimal  policy.  In  reinforce¬ 
ment  learning,  the  set  of  states  and  the  set  of  p'ossible  actions  are  known  a  priori, 
but  the  eifects  of  actions  and  the  production  of  reward  are  initially  unknown. 
Thus,  an  analyst  cannot  compute  an  optimal  policy  a  priori.  Insteeid,  an  optimal 
control  strategy  must  be  learned  through  experimentation  in  the  environment. 
Dynamic  programming  is  an  offline  analytical  tool;  reinforcement  learning  is  an 
online  adaptive  control  technique. 

The  work  in  this  dissertation  is  based  on  a  reinforcement  learning  algorithm 
called  Q-learning  developed  by  Chris  Watkins  [Watkins,  1989].  We  focus  on  Q- 
learning  because  of  its  simplicity  and  because  of  its  close  relationship  with  dynamic 
programming  and  the  Theory  of  Markov  decision  processes.  Q-learning  is  also 
significant  because  one  version  of  it,  namely  1-step  Q-learning,  when  applied  to 
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Markov  decision  processes  under  appropriate  conditions,  can  be  shown  to  converge 
to  an  optimal  policy.  Although  there  are  several  varieties  of  reinforcement  learning 
algorithms,  most  are  based  on  ideas  from  dynamic  programming  and  are  quite 
similar  to  Q-learning  [Samuel,  1963;  Michie  and  Chambers,  1968;  Sutton,  1984; 
Holland,  1975;  Kaelbling,  1990]. 

The  connection  between  Q-learning  and  dynamic  programming  is  strong.  Q- 
learning  can  be  thought  of  as  a  form  of  incremental  dynamic  programming,  where 
the  optimal  action- value  function  is  computed  incrementally  based  on  the  system’s 
interactions  with  its  environment.  Q-learning  can  be  derived  as  a  modification  of 
either  policy  iteration  or  value  iteration.  The  following  section  derives  Q-learning 
as  a  form  of  incremental  policy  iteration. 

3.3.1  Q-Learning  as  Incremental  Policy  Iteration 

Recall  that  in  policy  iteration,  an  optimal  policy  is  obtained  upon  convergence 
of  a  sequence  of  policy  improvement  stages.  During  each  stage,  the  action-value 
function,  Q/,  for  the  current  policy  is  computed  and  used  to  derive  an  improved 
policy  for  the  next  stage  (see  Figure  3.4).  The  improved  policy,  /',  is  chosen  by 
selecting  for  each  state  the  action  that  locally  maximizes  the  expected  return. 
That  is, 

Vxgs/V)  =  °  such  that  Q/(a:,o)  —  maxQ/(a;,6).  (3.30) 

The  new  policy  (via  the  policy  improvement  theorem)  is  guaranteed  to  be  uni¬ 
formly  as  good  as  or  better  than  the  current  policy. 

Computing  the  action-value  function  is  a  crucial  step  in  policy  iteration.  Clas¬ 
sically,  Qy  is  determined  by  first  solving  for  Vj  (using  the  linear  equations  defined 
in  Equations  3.11  and  3.12)  and  then  using  Equation  3.13  to  compute  its  val¬ 
ues.  Unfortunately,  both  of  these  steps  require  prior  knowledge  of  the  statistics 
that  govern  the  process  (namely  />(x,a)  and  /*v(a)),  which  for  learning  tasks  are 
initially  unknown. 

If  a  learning  algorithm  based  on  policy  iteration  is  to  be  developed,  it  will  be 
necessary  to  find  a  way  to  compute  the  action-value  function.  One  way  to  proceed 
is  to  estimate  the  needed  statistics  directly.  Following  this  approach  one  can  use 
observations  of  interactions  with  the  world  to  develop  estimates  of  {p(i,a)}  and 
{Piy(a)}.  These  estimates  can  then  be  used  to  compute  estimates  of  the  action- 
value  functions  and  eventually  compute  an  estimate  of  the  optimal  policy.  If 
the  world  is  stationary,  and  if  each  state-action  pair  is  tried  often  enough,  then 
accurate  estimates  can  be  obtained  and  eventually  the  optimal  policy  learned. 

The  trouble  with  this  approach  is  that  it  is  not  particularly  incremental.  It  is 
too  expensive  to  recompute  the  policy  after  each  step.  Indeed,  computing  a  policy 
just  once  can  be  very  expensive,  especially  for  large  problems.  If  this  method 
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were  to  be  used,  it  would  be  necessary  to  update  the  policy  periodically,  where 
the  recomputation  rate  would  depend  on  the  time  required  to  do  it.  During  the 
interval  between  recomputations  the  agent’s  policy  would  be  frozen  and  it  would 
not  be  able  to  take  advantage  of  the  experience  it  had  gained  since  the  last  policy 
computation.  Also,  this  method  would  be  inappropriate  if  the  agent  had  to  act 
continuously  in  the  world.  An  agent  may  not  be  able  to  afford  to  take  time  out 
to  recompute  its  policy. 

A  better  approach  that  is  computationally  less  expensive  and  lii.ore  incremental 
is  to  appeal  directly  to  the  definitions  of  Vy  and  Qj  and  estimate  them  directly  via 
Monte  Carlo  simulation.  One  way  to  estimate  V/(x)  directly  is  simply  to  average 
the  returns  observed  over  multiple  trials  (or  sequences)  that  begin  in  state  x  and 
follow  /.  A  similar  technique  can  be  used  to  estimate  Qj. 


Estimating  Vy  and  Qj 

If  the  system  is  in  state  x  at  time  t,  then  an  estimate  of  Vj(x)  can  be  obtained 
by  observing  the  rewards  obtained  by  following  /.  The  total  discounted  reward 
received  in  such  a  sequence  is  called  the  actual  return  for  time  /  and  is  given  by: 

=  r,  +  7r<+j  +  7V(+2  +  •••  +  +  ••••  (3.31) 

After  n  steps  an  estimate  of  the  actual  return  can  be  obtained  by  considering 
rewards  received  so  far.  This  is  called  the  n-step  truncated  return, 

ri"'  =  +  7r,+,  +  7V<+2  4- ...  +  7’’r,+„.  (3.32) 

If  rewards  are  bounded  and  7  <  1,  then  lim„_oc>  =  0.  So  for  a  sufficiently 

large  n  an  accurate  estimate  of  the  actual  return  can  be  obtained. 

An  accurate  estimate  of  Vy(x)  can  then  be  obtained  by  averaging  a  large 
number  of  n-step  truncated  returns.  An  estimate  of  Vy  can  then  be  obtained  by 
performing  enough  of  these  experiments  for  each  state-action  pair. 

The  advantage  of  estimating  the  value  function  directly  using  Monte  Carlo 
methods,  instead  of  computing  it  by  solving  a  set  of  linear  equations,  is  that 
the  estimate  can  be  updated  incrementally  as  new  experiences  provide  additional 
information.  However,  there  are  still  a  number  of  problems  with  this  approach: 

1.  The  value  of  the  truncated  return  is  only  available  after  n  steps.  If  n  is 
large,  as  it  should  be  to  obtain  an  accurate  estimate,  this  delay  could  be 
significant.  Also,  if  truncated  returns  for  multiple  state-action  pairs  are  to 
be  obtained  simultaneously,  then  a  possibly  long  sequence  of  state,  action, 
and  reward  triples  must  be  remembered. 
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2.  For  stochastic  systems,  the  truncated  returns  obtained  may  have  a  large 
variance,  and  many  experiments  will  be  required  to  estimate  the  expected 
value  accurately, 

3.  In  order  to  obtain  the  truncated  return,  the  agent  must  follow  /  for  n  steps 
before  it  can  try  a  non-policy  action.  Thus,  non-optimal  actions  can  only 
be  tried  every  n  steps,  and  return  estimates  for  non-optimal  actions  are 
expensive  to  obtain. 

These  problems  can  be  overcome  by  using  a  different  kind  of  estimator.  Instead 
of  accumulating  actual  rewards  for  a  long  series  of  steps,  we  can  define  an  estimator 
that  has  a  component  for  the  rewards  that  have  actually  been  received  (up  to  n 
steps)  and  a  component  that  estimates  the  reward  to  be  received  in  the  future 
(after  n  steps).  That  is,  suppose  the  agent  performs  an  experiment  as  before. 
Starting  in  state  x  it  executes  /  for  n  steps.  An  estimate  of  the  actual  return 
received  can  be  obtained  by  augmenting  the  n-step  truncated  return  with  a  term 
that  accounts  for  the  reward  that  would  be  received  after  time  t  +  n.  If  the  state 
obtained  after  n  steps  is  xj+n,  then  the  expected  value  of  the  reward  to  be  received., 
after  n  steps,  assuming  the  system  follows  /,  is  just  V/(x<+n)*  Of  course,  V/  is 
not  known  ahead  of  time.  It  is  what  we  are  trying  to  estimate.  However,  an 
approximation  of  Vj  can  be  obtained  from  our  current  estimate.  Let  Ut  denote 
the  approximation  to  V/  at  time  t. 

The  corrected  n-step  truncated  return  is  then  defined  as 


rl"^  =  r,  -1-  7r,+i  -f  ...  -f  7""V,+„_i  -H  7"C/((x(+n). 

(3.33) 

If  U  equals  Vj,  then  is  an  unbiased  estimator  of  V/.  That  is. 

=  V}(x). 

(3.34) 

Even  when  Ut  is  not  an  accurate  estimate  of  V),  the  corrected  n-step  estimate 
is  still  useful  for  estimating  the  action-value  function.  This  is  due  to  what  Watkins 
calls  the  error-reduction  property  of  corrected  truncated  returns  [Watkins,  1989]. 
In  particular,  let  M  be  the  maximum  absolute  error  in  Ut  with  respect  to  V/,  that 

lO, 

M  =  max|l/,(x)  -  V/(x)|. 

icS 

(3.35) 

Then  clearly, 

max  lE[ri"Hx)]  -  V/(x)|  <  7"M. 

(3.36) 

The  significance  of  this  result  is  that  the  average  of  the  corrected  truncated  returns 
for  any  given  state  will  never  have  an  error  greater  than  7"M.  Also,  M  will  tend 
to  zero  as  experiments  with  the  maximally  erroneous  state  accumulate.  Thus, 
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for  a  fixed  policy  /,  the  value  function  can  be  estimated  accurately  by  averaging 
corrected  n  step  truncated  returns. 

There  are  two  key  advantages  to  <x>rrected  truncated  returns.  First  they  are 
effective  even  when  n  is  small.  The  error-reduction  property  holds  for  all  n  >  0. 
This  means  that  1)  estimates  for  the  action- value  function  can  be  obtained  without 
excessive  delay,  2)  estimates  for  non-policy  actions  can  be  obtained  more  easily 
since  the  agent  need  not  follow  its  policy  for  a  long  sequence  of  actions  in  order 
to  obtain  useful  estimates,  and  3)  variation  in  the  estimate  due  to  the  sum  of  a 
large  number  of  random  variables  can  be  reduced.  In  general,  there  is  a  tradeoff 
between  the  variance  in  an  estimate  and  the  effect  of  bias  in  an  estimate  due  to 
error  in  Uu  For  large  n,  the  variance  in  a  single  estimate  will  be  large  compared 
to  that  for  small  n.  Conversely,  the  impact  of  errors  in  Ut  decreases  exponcnti2illy 
with  increasing  n. 

A  second  advantage  of  corrected  truncated  returns  is  that  a  weighted  sum  of 
corrected  truncated  returns  for  different  values  of  n  can  be  implemented  efficiently 
and  locally  in  time.  In  particular,  the  n  +  1-step  estimator  can  be  expressed  in 
terms  of  the  n-step  estimator  tis  follows: 

=  rl"'  +  7"lr,+„  +  -  U, (!»„)).  (3.37) 

The  bracketed  term  on  the  right  side  of  the  equation  can  be  computed  locally  in 
time  since  it  depends  only  upon  the  observables  Tj,  Xt+„+i  and  Xt+„.  Using  this 
relationship,  a  series  (or  weighted  sum)  of  corrected  estimators  can  be  efficiently 
computed  over  time  and  used  to  update  estimates  of  Vj  and  Qj  incrementally.  The 
bracketed  term  on  the  right  is  the  difference  between  two  estimates  of  V/(xt+„). 
The  term  Ut+n{xt+n)  is  an  estimate  available  at  time  t  -f  n.  The  term  -h 
7t/t+„+i(xt+„+i)  is  an  estimate  obtained  by  waiting  for  one  step  and  observing 
the  next  state  and  rewards  that  follow.  The  term  [rt+„  -f  'yUt{xt+n+i)  —  Ut(xt+„)], 
called  the  temporal  difference  error,  estimates  the  error  in  t/t+„(xt+„)  based  on 
information  gained  after  one  time  step.  Techniques  based  on  these  temporal  dif¬ 
ferences  (TD)  were  first  developed  and  formally  analyzed  by  Sutton,  who  showed 
them  to  be  useful  for  a  wide  range  of  prediction  and  estimation  tcisks  [Sutton,  1984; 
Sutton,  1988;  Sutton  and  Pinette,  1985]. 


Incremental  Policy  Improvement 

Equipped  with  a  range  of  methods  for  estimating  an  agent’s  action-value  function, 
we  must  now  consider  the  second  major  operation  in  the  policy  iteration  algorithm, 
policy  improvement.  A  straightforward  approach  is  to  update  the  policy  only  after 
an  accurate  estimate  of  Qy  has  been  obtained.  The  trouble  with  this  approach  is 
that  it  may  take  a  long  time  to  estimate  Qj.  In  the  meantime,  the  agent’s  policy 
will  continue  to  be  subopt imal  and  fail  to  reflect  the  information  gained  since  the 
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last  updating.  In  Q-learning,  the  agent’s  policy  is  updated  after  each  time  step 
so  that  it  reflects  the  most  current  estimate  of  the  action-value  function  and  the 
experience  gained  so  far. 

A  particularly  simple  version  of  the  Q-learning  algorithm  is  shown  in  Figure  3.6 
The  first  step  in  this  algorithm  is  to  initialize  the  agent’s  action- value  function.  If 
prior  knowledge  of  the  task  is  known,  it  may  be  possible  to  choose  initial  values 
that  positively  bias  the  action-value  function.  Otherwise,  random  or  uniform 
initial  values  will  do.  Next,  the  agent  enters  the  main  control  cycle.  The  first 
step  in  the  control  cycle  is  to  determine  the  current  state  of  the  world.^  Next, 
the  agent  evaluates  its  policy  function  for  the  current  state  to  obtain  an  estimate 
of  the  optimal  control  action.  Most  of  the  time  the  agent  executes  this  action, 
but  occasionally  an  action  is  chosen  at  random.  Random  actions  help  to  ensure 
experimentation  with  all  state-action  pairs.  The  general  question  of  how  to  trade 
off  exploration  (acting  to  gain  information  or  experience)  against  exploitation 
(acting  to  gain  reward)  is  difficult  to  answer  and  has  a  long  history  in  game  theory, 
matl  ^matical  statistics,  and  optimal  control  [Robbins,  1952;  Bradt  et  al.,  1956; 
Bradt  and  Karlin,  1956;  Feldaman,  1962;  Dubins  and  Savage,  1965;  Epstein,  1967; 
Holland,  1973;  Holland,  1975;  Kaelbling,  1990;  Hartman,  1990).  The  algorithm  in 
Figure  3.6  takes  a  particularly  simple  approach,  choosing  on  each  step  a  random 
action  with  a  fixed  probability  p.  A  slightly  more  sophisticated  approach  is  to 
choose  actions  according  to  a  Boltzmann  distribution,  where  the  probability  of 
selecting  action  a  in  state  x  is  given  by 

eQMT 

and  where  the  temperature  parameter  T  is  slowly  decreased  over  time  in  order 
to  decrease  exploration  as  experience  accumulates.  After  performing  the  selected 
action,  the  agent  observes  the  next  state,  y,  and  the  reward,  r,  obtained.  These 
observations  are  used  to  construct  a  corrected  1-step  estimate  for  Q{xt,at), 

=  r -b  7(/(j/).  (3.39) 

The  1-step  estimator  is  particularly  convenient  since  it  is  immediately  available. 
However,  more  elaborate  multi-step  estimators  can  also  be  used.  Next  (Step  5), 
the  action-value  for  the  current  state-action  pair  is  updated  using  the  rule: 

Q{x,a)  :=  (1  -  a)Q(a:,a)  +  a[r  -f  'yU(y)]  (3.40) 

where  q  is  a  learning  rate  parameter  ranging  between  0  and  1.  This  updating  rule 
implements  a  temporally  weighted  average,  where  the  recent  estimates  receive 

^Typically,  it  IS  assumed  that  the  state  is  defined  in  terms  of  the  agent’s  current  sensory 
inputs  However,  this  is  not  an  intrinsic  part  of  the  theory. 
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Q  a  set  of  initial  values  for  the  action-value  function  (e.g.,  uniformly  zero) 
f{x)  ;=  a  such  that  Q{x,a)  =  maxbgA  6), 

Repeat  forever: 

1)  X  :=  the  current  state 

3)  Select  an  action  a  to  execute  that  is  usually  consistent  with  f 
but  occasionally  an  alternate.  For  example,  one  might  choose  to 
follow  /  with  probability  p  and  choose  a  random  action  otherwise. 

4)  Execute  action  a,  and  let  y  be  the  next  state  and  r  be  the  reward 
received. 

5)  Update  Q(x,a),  the  action- value  estimate  for  the  state-action  pair  (x,a): 

Q(x,  a)  +-  Q{x,  a)  -f-  a[r  -f-  (y)] 

where  U{y)  =  Q{yJ{y)). 

6)  Update  the  policy  /: 

f{x)  =  a  such  that  Q{x^a)  =  max6g6//i 


Figure  3.6:  A  simple  version  of  the  1-step  Q-learning  algorithm. 

more  weight  than  old  estimates.  Temporally  sliding  estimates  of  this  type  are 
appropriate  for  adaptive  control  in  general  since  the  task  may  be  non-stationary. 
However,  they  are  especially  appropriate  for  Q-learning  since  early  action-value 
estimates  may  have  large  errors  compared  to  later  ones.  Finally  (Step  6),  the 
agent’s  policy  is  updated  and  a  new  cycle  begins. 

3.3.2  The  Convergence  of  Q-Learning 

Even  though  Q-learning  is  based  on  dynamic  programming,  it  is  not  clear  that 
it  will  necessarily  learn  an  optimal  policy.  Indeed,  in  the  general  case  (i.e.,  when 
using  multi-step  estimators),  the  convergence  question  remans  open.  However,  it 
has  been  shown  that  under  appropriate  conditions  1-step  Q-leaming  is  guaranteed 
to  converge  on  an  optimal  policy  (Watkins  and  Dayan,  1992].  A  set  of  conditions 
sufficient  to  guarantee  convergence  to  an  optimal  policy  follows: 

1.  the  underlying  decision  process  is  Markovian, 

2.  the  corrected  1-step  estimator  is  used  to  update  Q, 
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3.  each  state  action-pair  is  tried  infinitely  often, 

4,  the  learning  rate  parameter  q  is  varied  with  time  so  that  On,  the  learning 
rate  at  time  n, 

(a)  is  positive  and  monotonically  decreasing  to  zero, 

(b)  the  sum  of  a„  as  n  — ♦  oo  diverges,  and 

(c)  the  sum  of  [q:„]^  as  n  -+  oo  is  finite. 
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4  The  Block  Stacking  Testbed 


We  are  now  in  a  position  to  address  the  central  topic  of  this  dissertation,  namely, 
the  integration  of  active  perception  (for  efficient  task-oriented  perception)  and 
reinforcement  learning  (for  adaptive,  non-model  based  control).  This  chapter 
describes  the  specific  block  manipulation  task  used  in  the  experiments  and  also 
presents  a  formal  mathematical  model  for  describing  the  control  tasks  facing  an 
embedded  learning  system  that  interacts  with  the  world  through  an  active  visuo- 
motor  system. 


4.1  The  Block  Manipulation  Task 

The  external  part  of  the  block  manipulation  task  is  quite  similar  to  the  block 
stacking  tasks  studied  in  classical  planning  [Nilsson,  1980;  Sussman,  1975].  The 
block  manipulation  task  involves  a  simulated  robot  working  on  a  simulated  assem¬ 
bly  line,  as  shown  in  Figure  4.1.  The  robot’s  job  is  to  arrange  piles  of  blocks  into 
certain  desirable  (or  goal)  configurations  before  they  fall  off  the  end  of  a  conveyor. 
If  the  robot  manages  to  configure  the  blocks  properly,  it  receives  a  positive  reward, 
the  pile  disappears,  and  a  new  pile  appears  at  the  head  of  the  conveyor.  If  after 
steps  the  robot  fails  to  configure  the  pile,  the  blocks  fall  off  the  conveyor,  the 
trial  ends,  and  a  new  pile  appears  at  the  head  of  the  conveyor.  The  piles  coming 
down  the  conveyor  vary  in  the  number  of  blocks  they  contain,  in  the  properties  of 
the  blocks,  and  in  the  initial  arrangement  of  the  blocks.  However,  only  one  pile 
is  on  the  conveyor  at  a  time,  and  it  is  possible  to  properly  configure  each  pile  in 
the  allotted  time.  In  general,  there  are  no  restrictions  on  the  number  of  blocks 
that  can  be  in  a  pile;  however  in  our  experiments  we  limited  piles  to  20  or  fewer 
blocks. 

The  robot’s  objective  is  to  accumulate  as  much  reward  as  possible,  or  in  ob¬ 
jective  terms,  to  configure  piles  as  quickly  as  possible. 

Blocks  have  various  features  that  differentiate  them  from  one  another  and  that 
are  used  to  define  ‘‘proper  configurations.”  For  example,  one  can  imagine  blocks 
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Figure  4.1:  A  sketch  of  the  block  manipulation  task.  Key  properties  of  this  t2isk 
are  1)  the  robot  has  a  limited  amount  of  time  to  configure  a  pile  of  blocks  properly, 
2)  piles  vary  considerably  from  trial  to  trial  and  may  contain  an  arbitrary  number 
of  blocks,  3)  the  robot  has  no  a  priori  knowledge  of  the  task,  but  learns  it  by 
receiving  a  reward  upon  the  successful  completion  of  a  trial,  and  4)  in  addition 
to  learning  overt  physical  actions,  the  robot  must  also  learn  to  control  an  active, 
but  limited,  sensory  system. 
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coming  in  three  sizes  (small,  medium,  and  large)  and  three  colors  (red,  green, 
and  blue)  and  a  “proper  configuration”  being  defined  as  a-smalUgreen-block-on- 
a-large-red-block. 

We  assume  that  block  dynamics  are  similar  to  those  typically  used  in  classical 
planning.  That  is, 

1.  The  conveyor  (or  table)  is  large  enough  to  support  an  infinite  number  of 
blocks. 

2.  A  pile  consists  of  a  series  of  stacks,  where  each  stack  is  defined  to  be  a 
column  of  one  or  more  blocks,  and  each  block  can  directly  support  or  be 
directly  supported  by  at  most  one  block. 

3.  The  robot  has  a  single  manipulator,  which  it  can  use  to  move  one  block  at 
a  time. 

4.  Only  the  robot  moves  blocks,  and  the  robot's  actions  have  deterministic 
effects.* 

5.  A  block  can  be  picked  up  only  if  it  does  not  support  another  block,  and  a 
block  can  be  placed  only  on  the  table  or  on  a  clear  block.  Whole  stacks 
cannot  be  moved,  nor  can  blocks  be  fitted  into  the  middle  of  a  stack. 

Attention  is  focused  on  one  particularly  simple  block  manipulation  task,  which 
we  will  call  the  GB-task  (for  GREEN*BLOCK-task).  In  this  task,  blocks  come  in 
three  colors:  red,  green,  and  blue,  and  the  agent  receives  a  positive  reward,  Rg, 
whenever  it  manages  to  pick  up  a  green  block.  That  is,  goal  states  consist  of  those 
configurations  of  the  physical  world  in  which  the  robot  is  holding  a  green  block. 
We  also  assume  that  each  pile  contains  exactly  one  green  block.*  Even  though  this 
task  may  seem  simple,  it  is  important  to  realize  that  it  is  not  completely  trivial 
since  the  robot  may  need  to  unstack  a  considerable  number  of  blocks  before  it 
can  uncover  the  green  block.  Also,  since  the  robot  is  to  learn  the  task,  its  sensor 
and  effectors  are  initially  uninterpreted  and  it  has  no  a  priori  knowledge  of  the 
task.  Finally,  keep  in  mind  that  the  robot  has  a  sensory-motor  system  that  is 
flexible,  but  limited  in  the  amount  of  information  it  provides  the  decision  system. 

'As  we  will  see  in  later  chapters,  for  certain  instantiations  of  the  CR-method  this  assuniption 
can  be  rela.\ed. 

*The  reason  for  this  assumption  will  be  made  clear  in  the  coming  chapters,  when  we  describe 
an  approach  for  dealing  with  multiple  green  blocks.  In  the  meantime,  however,  the  intuitive 
reason  for  the  restriction  is  to  simplify  the  amount  of  sensory  processing  required  by  the  robot. 
In  particular,  if  multiple  green  blocks  exist,  then  the  robot  (if  it  is  to  behave  optimally)  must 
analyze  and  compare  the  utility  of  strategies  for  pursuing  each  green  block.  While  this  selection 
process  is  important,  it  is  a  complication  that  gets  in  the  way  of  studying  more  basic  issues,  so 
it  is  ignored  for  now. 
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In  addition  to  learning  to  perform  the  correct  overt  actions,  the  robot  must  also 
learn  to  control  its  sensory  system  in  order  to  generate  an  adequate,  task-specific 
representation. 


4.2  The  Sensory-Motor  System 

The  sensory-motor  system  is  considerably  different  from  the  objective  represen¬ 
tations  traditionally  used.  The  robot  is  equipped  with  an  active  sensory-motor 
system  that  provides  selective,  but  limited,  access  to  the  external  world.  Our 
contention  is  that  many  problems  of  interest  only  require  keeping  track  of  a  few 
objects  at  a  time  (for  example,  see  [Chapman,  1989]). 

The  specification  for  the  embedded  decision  system’s  interface  to  the  sensory- 
motor  system  is  shown  in  Figure  4.2.  On  the  sensory  side,  the  system  at  each 
point  in  time  generates  a  20-bit  binary  input  vector,  which  defines  the  state  of 
the  agent’s  internal  representation.  The  information  encoded  in  the  input  vector 
falls  into  three  general  categories:  peripheral  aspects,  local  aspects,  and  relational 
aspects.  In  general,  peripheral  aspects  register  spatially  non-specific  information 
about  the  world,  such  as  the  presence  or  absence  of  indexical  properties  (e.g., 
colors,  shapes,  and  textures).  Our  robot’s  sensory-motor  system  has  3  peripheral 
bits  that  register  the  presence  of  red,  green,  and  blue  in  the  scene,  and  1  bit 
that  registers  whether  or  not  the  robot  is  holding  an  object.  Both  local  and 
relational  aspects  register  properties  of  objects  that  are  marked.®  Local  aspects 
register  intrinsic  local  features  of  an  object,  such  as  its  shape,  color,  orientation, 
and  texture.  Relational  aspects  register  relational  properties  between  marked 
objects,  such  as  relative  shape,  relative  color,  and  relative  position.  The  system 
in  our  experiments  has  two  markers,  called  the  action-frame  and  the  attention- 
frame,  respectively.  The  local  aspects  for  these  markers  define  14  of  the  20  bits 
in  the  input  vector.  Local  aspects  register  the  color  of  the  marked  object  (2- 
bits/marker),  the  shape  of  the  object  (l-bit/marker),  whether  or  not  the  marked 
object  is  currently  being  held  (l-bit/marker)  and  the  number  of  blocks  on  top 
of  the  marked  object  (2-bits/marker).  The  robot’s  sensory  system  registers  two 
relational  aspects  —  one  for  recording  vertical  alignment  between  markers  (1-bit) 
and  one  for  recording  horizontal  alignment  (1-bit). 

The  internal  motor  commands  supported  by  the  sensory-motor  systems  are 
shown  on  the  right  in  Figure  4.2.  Two  types  of  internal  motor  commands  are 
distinguished:  overt  actions  and  perceptual  actions.  Overt  actions  are  commands 
that  change  the  state  of  the  external  world  with  respect  to  the  task.  This  can 
either  occur  by  performing  a  physical  action  that  manipulates  some  object  in  the 

^These  bits  correspond  to  aspects  in  Agre  and  Chapman’s  theory  [Agre,  1988;  Chapman, 
1990b] 
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Figure  4.2:  A  specification  for  the  block  stacking  robot’s  sensory-motor  system. 
The  system  has  two  markers,  an  action-frame  marker  and  an  attention-frame 
marker.  The  system  has  a  20- bit  input  vector,  8  overt  actions,  and  6  perceptual 
actions.  The  values  registered  in  the  input  vector  and  the  effects  of  internal 
action  commands  depend  upon  the  bindings  between  markers  in  the  sensory-motor 
system  and  objects  in  the  external  world. 


physical  world  (e.g.,  picking  up  a  block)  or  by  changing  the  configuration  in  the 
sensory-motor  system  in  a  way  that  changes  the  robot’s  ability  to  manipulate 
objects  in  the  external  world  (e.g.,  moving  a  marker  that  establishes  the  reference 
frame  used  by  other  overt  actions).  Perceptual  actions  are  commands  that  change 
the  agent’s  perception  of  the  world  but  do  not  affect  the  external  world  or  the 
agent’s  ability  to  interact  with  it.  The  robot  has  a  total  of  14  internal  motor 
commands:  8  are  overt,  6  are  perceptual. 

All  overt  actions  are  associated  with  the  action-frame  marker.  In  addition 
to  providing  perceptual  information,  this  marker  is  also  used  to  establish  the 
reference  frames  for  manipulating  objects  in  the  world.  The  two  primary  overt 
actions  are  for  grasping  and  for  placing  objects.  For  grasping,  the  action  grasp- 
ohject-at-action-frame  causes  the  robot  to  pick  up  the  object  marked  by  the  action- 
frame  marker.  The  action  succeeds  if  the  robot’s  hand  is  empty  and  the  marked 
object  has  a  clear  top.  For  placing,  the  action  place-object-at-action-frame  causes 
the  system  to  place  a  block  it  is  holding  on  top  of  the  object  pointed  to  by  the 
action-frame  marker.  This  action  works  if  the  robot  is  holding  a  block  and  the 
target  object  has  a  clear  top.  The  remaining  overt  actions  are  used  to  move  the 
action-frame  to  different  spatial  locations.  These  commands  allow  the  agent  to 
index  functionally  relevant  objects  by  primitive  indexical  properties  (color  and 
shape)  or  by  spatial  relationships  (top-of-stack  and  bottom-of-stack).  Although 
these  indexing  actions  do  not  change  the  external  world  directly  (i.e.,  no  blocks  get 
moved),  they  are  overt  actions  in  the  strictest  sense  because  they  affect  the  robot’s 
ability  to  manipulate  the  external  world.  Moving  the  action  frame  changes  the 
effects  of  grasp  and  place  actions.  Once  a  marker  is  placed  on  an  object,  it  tracks 
the  object  until  the  binding  is  explicitly  broken  with  another  indexing  command. 

A  repertoire  of  perceptual  actions  are  associated  with  the  attention-frame 
marker.  These  actions  are  used  exclusively  for  gathering  additional  sensory  infor¬ 
mation.  No  overt  actions  depend  on  the  placement  of  the  attention-frame  marker. 
As  for  the  action-frame,  indexical  properties  are  used  to  guide  the  placement  of  the 
attention  frame.  As  will  be  seen  in  Chapter  6,  the  attention  frame  marker  plays 
an  important  role  in  allowing  the  system  to  disambiguate  functionally  different 
situations  in  the  world. 

The  robot’s  sensory-motor  system  was  directly  motivated  by  Ullman’s  visual 
routines  model  and  Agre  and  Chapman’s  work  on  deictic  representations;  and 
although  it  is  a  simplification,  it  embodies  many  of  the  essential  ideas  of  the 
active-perception  paradigm.  In  particular,  the  agent’s  internal  representation  is 
flexible,  but  limited  in  scope.  This  follows  since  the  internal  representation  is 
almost  completely  defined  in  terms  of  the  action  and  attention  frame  markers 
(or  processing  foci),  which  are  actively  controlled  by  a  higher  level  decision  com¬ 
ponent.  One  feature  that  may  appear  to  be  missing  is  the  ability  to  assemble 
complex  visual  routines  from  elemental  operations.  However,  this  is  not  the  case 
since  in  general  visual  routines  are  assembled  and  dispatched  by  higher  level  de- 
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cision  components.  In  our  robot,  “visual  routines”  emerge  as  the  decision  system 
learns  to  control  the  sensory-motor  system  in  order  to  gain  needed  information. 
Admittedly,  some  of  our  “primitive  aspects”  are  unrealistically  complex  and  would 
probably  be  implemented  using  visual  routines  in  a  more  realistic  system. 

Notice  that  the  internal  state  space  defined  by  the  sensory  inputs  is  small 
compared  to  the  state  space  that  could  result  if  every  object  in  the  domain  were 
represented  objectively  [Swain,  1990].  The  principjd  advantage  of  this  reduced 
internal  representation  is  that  it  leads  to  more  feasible  perception  and  a  simpler 
decision  task.  For  example,  a  pile  of  50  blocks  would  require  100  bits  and  a  state 
space  of  at  least  3®°  (or  7.1  x  10’^  states)  just  to  encode  the  color  of  each  block.  Our 
robot’s  input  vector  is  limited  to  20  bits  (or  about  a  million  states).  The  principal 
disadvantage  of  the  reduced  representation  is  that  it  limits  the  complexity  of  the 
problems  that  can  be  solved  by  the  agent.  For  example,  if  during  the  course  of  a 
problem,  a  decision  depends  upon  features  of  three  separate  blocks,  then  the  robot 
will  not  be  capable  of  solving  the  problem  since  it  cannot  simultaneously  represent 
features  of  more  than  two  blocks.  Of  course,  some  sort  of  memory  mechanism 
could  be  added  or  the  sensory-motor  system  could  be  expanded  to  allow  the  system 
to  register  more  information  (for  example,  by  adding  an  additional  marker),  but 
in  general  new  problems  can  always  be  defined  that  are  beyond  the  scope  of  the 
internal  representation. 

Also,  notice  that  individual  objects  in  the  world  are  referenced  not  by  arbitrar¬ 
ily  assigned  names,  but  by  the  features  that  make  them  relevant.  For  example, 
the  action  Movt-action’marker~to~stach~iop  would  cause  the  action-frame  to  move 
upwards  from  its  current  position  until  it  reaches  the  block  at  the  top  of  the  stack. 
What  makes  this  top  block  significant  is  not  any  absolute  name  like  “BLOCK-43,” 
but  the  relationship  it  holds  with  the  rest  of  the  world.  Namely,  this  block  is  at 
the  top  of  a  stack  and  affords  (Gibson,  1979]  being  removed  and  placed  on  the 
table  (possibly  to  get  at  another  more  important  block)  [Agre,  1988].  The  variety 
of  features  and  properties  that  can  be  used  as  indexicals  also  delimits  the  types 
of  problems  that  an  agent  can  solve. 

Finally,  notice  that  physical  actions  in  the  world  (e.g.,  picking  and  placing 
blocks)  are  performed  relative  to  reference  frames  established  by  the  action-frame. 
This  is  consistent  with  the  view  that  objects  in  the  world  fill  roles  according  to 
their  features  and  that  the  control  strategj’  learned  by  the  decision  system  should 
be  specified  in  terms  of  those  abstract  roles  [Agre,  1988;  Ballard,  1989b]. 

Our  objective  is  to  develop  adaptive  control  systems  that  can  learn  to  solve 
the  GB-task.  This  objective  raises  an  interesting  question.  Does  such  a  control 
strategy  exist?  That  is,  given  the  limited  capabilities  of  the  robot’s  sensory  system, 
does  a  stationary  decision  policy  (i.e.,  a  fixed  mapping  from  internal  states  to 
action  commands)  exist  that  solves  the  task?  For  the  GB-task,  the  answer  to  this 
question  is  “yes”.  Figure  4.3  shows  a  list  of  condition-action  rules  that  define 
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If  1)  the  hand  is  not  empty  and 

2)  the  action-frame  is  not  on  the  table, 
then  move  the  action  frame  to  the  table. 

If  1)  the  hand  is  not  empty  and 

2)  the  action-frame  is  on  the  table, 

then  place  the  held  object  at  the  location  marked  by  the  action  frame. 

If  1)  the  hand  is  empty  and 
2)  the  attention-frame  is  not  on  a  green  block, 
then  move  the  attention-frame  to  the  green  block. 

If  1)  the  hand  is  empty, 

2)  the  attention-frame  is  on  a  green  block,  and 

3)  the  attention-frame  and  the  action-frame  are  not  vertically  aligned, 
then  move  the  action-frame  to  the  green  block. 

If  1)  the  hand  is  empty, 

2)  the  attention-frame  is  on  a  green  block, 

3)  the  attention-frame  and  the  action-frame  are  vertically  aligned,  and 

4)  the  object  marked  by  the  action-frame  has  a  clear  top, 
then  pick  up  the  object  marked  by  the  action-frame. 

If  1)  the  hand  is  empty, 

2)  the  attention-frame  is  on  a  green  block, 

3)  the  attention-frame  and  the  action-frame  are  vertically  aligned,  and 

4)  the  object  marked  by  the  action-frame  is  not  clear, 
then  move  the  action-frame  to  the  top  of  the  stack. 


Figure  4.3:  A  fixed  set  of  rules  that  reliably  solve  the  GB-task.  Depending  upon 
the  distribution  of  problem  instances  (piles),  this  strategj'  may  or  may  not  be 
optimal.  In  any  case,  it  is  nearly  optimal. 
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such  a  policy.  The  GB-task  can  be  solved  by  1)  keeping  the  robot’s  hand  clear 
—  so  that  it  can  pick  up  blocks  as  needed,  2)  using  the  attention-frame  to  locate 
the  green  block,  3)  when  not  involved  in  placing  an  object,  move  the  action-frame 
to  the  top  of  the  stack  containing  the  green  block,  and  4)  pick  up  any  clear  block 
from  the  green  stack  —  either  to  decrease  the  stack  height  or  to  pick  up  the  green 
block. 

4.3  Perceptual  Mappings 

Let  us  study  the  properties  of  an  active  perceptual  system  in  more  detail.  Percep¬ 
tion  can  be  defined  as  the  process  of  mapping  situations  in  the  world  onto  states 
in  an  agent’s  internal  representation.  Following  this  definition  in  the  most  gen¬ 
eral  sense,  the  perceptual  system  can  include  all  the  sensory  and  computational 
processes  that  provide  information  to  the  internal  representation.  This  could  in¬ 
clude  short  term  memory  processes  used  to  maintain  and  recall  information  about 
previous  events.  However,  we  will  restrict  ourselves  to  agents  whose  internal  rep¬ 
resentations  are  defined  solely  in  terms  of  their  immediate  sensory  inputs.  In 
this  case,  the  agent’s  sensory  system  performs  the  mapping  from  situations  in  the 
world  to  internal  states. 

Since  in  general  there  are  an  unbounded  number  of  different  situations  in  the 
world  (i.e.,  every  moment  is  unique  in  some  respect),  and  we  are  concerned  with 
systems  that  have  only  a  finite  number  of  internal  states,  some  internal  states  must 
necessarily  represent  multiple  world  states.  We  call  this  overloading  of  internal 
states  perceptual  reduction.  Perceptual  reduction  is  fundamental  to  perception, 
it  cannot  be  avoided.  However,  it  can  be  helpful  or  it  can  be  a  hindrance.  In 
particular,  if  the  perceptual  mapping  is  chosen  correctly,  then  each  internal  state 
will  represent  situations  that  are  functionally  equivalent.  Conversely,  if  the  map¬ 
ping  is  chosen  incorrectly  then  the  equivalence  class  associated  with  some  internal 
states  may  contain  situations  that  are  functionally  dissimilar.  Under  these  cir¬ 
cumstances,  the  internal  state  may  say  nothing  useful  about  the  current  situation 
with  respect  to  the  task.  Agre  has  called  the  correct  or  useful  overloading  of 
internal  states  passive  abstraction  (Agre,  1988].  We  will  adopt  this  nomenclature 
and  use  the  term  perceptual  aliasing  to  denote  the  inco’^rect  (or  unproductive) 
overloading  of  internal  states. 

An  example  of  passive  abstraction  for  the  GB-task  is  shown  in  Figure  4.4.  In 
this  case,  two  different  piles  of  blocks  in  the  world,  due  to  careful  placement  of 
the  action  and  attention  frames,  generate  the  same  internal  representation.  The 
equivalence  class  associated  with  this  internal  state  consists  of  those  situations 
where: 

1.  the  hand  empty. 


hi 


Internal 

representation: 


1110000000001001 1010 


Figure  4.4:  An  example  of  passive  abstraction  in  the  GB>task.  In  this  case,  both 
world  states  share  the  same  optimal  action.  In  the  figure  the  (-f )  represents  the 
location  of  the  action-frame  marker  and  the  (*)  represents  the  location  of  the 
attention-frame  marker. 

2.  a  green  block  is  covered  by  a  red  block  that  itself  is  clear  (there  may  be 
other  blocks  between  the  red  and  the  green  ones), 

3.  the  action-frame  marker  is  on  the  red  block, 

4.  there  are  red,  green,  and  blue  blocks  in  the  pile. 

With  respect  to  the  GB-task,  w'henever  this  internal  state  is  encountered  the 
optimal  action  to  perform  is  grasp-object-at-action-frame.  That  is,  situations  rep¬ 
resented  by  this  particular  sensory  input  vector  are  functionally  equivalent  with 
respect  to  the  the  GB-task. 

Notice  how  information  about  the  other  irrelevant  blocks  in  the  pile  is  not 
encoded  in  the  internal  representation.  Most  of  the  irrelevam  information  (e.g., 
information  about  stacks  other  than  the  one  containing  the  green  block)  is  ab¬ 
stracted  out  of  the  representation  automatically.  Nevertheless,  this  internal  state 
does  encode  some  irrelevant  information  —  in  particular,  the  fact  that  the  top 
block  is  red  and  that  there  are  blue,  green,  and  red  colors  detected  in  the  scene  is 
irrelevant. 

Intuitively,  an  internal  state  is  most  useful  when 

1.  the  equivalence  class  associated  with  the  state  is  as  large  as  possible  (i.e., 
the  representation  does  not  make  irrelevant  distinctions},  and 
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World  state  1: 


World  state  2: 


R* 

G 

Internal 

representation: 


1 1 1 00000000 1 00000000 


Figuie  4.5;  An  e.xample  of  perceptual  aliasing  from  the  GB-task.  In  this  case  two 
world  states  with  different  optimal  actions  generate  the  same  internal  representa¬ 
tion. 

5.  all  the  situations  the  world  repressnte-  y  a  state  are  functionally  equiv¬ 
alent  in  terms  of  the  actions  required  for  optimal  control. 

An  exarrple  of  perceptual  aliasing  is  shown  in  Figure  4.5.  In  this  case,  marker 
placements  are  such  that  two  functionally  different  situations  generate  the  same 
input  vector.  The  optimal  action  for  the  pile  on  the  left  is  to  move  the  action-frame 
to  the  green  block  {move-action-fiame-to-green),  whereas,  the  optimal  ciction  for 
the  pile  on  the  right  is  to  pick  up  the  marked  red  block  {grasp-object-at-action- 
frame).  The  trouble  with  this  internal  state  is  that,  given  only  this  information, 
the  decision  system  cannot  distinguish  between  these  two  different  cases  and  so 
cannot  be  guaranteed  to  select  the  optimal  action. 

In  systems  with  fixed  (or  passive)  sensory  systems,  the  burden  of  choosing 
an  appropria.  internal  representation  (and  perceptual  function)  is  placed  on  the 
system’s  design^-  Tne  objective  representations  commonly  used  in  AI  are  pas¬ 
sive  representations.  In  this  case,  every  potentially  relevant  object  is  identified, 
named,  and  objectively  represented.  Also,  they  are  typically  intended  to  be  gen¬ 
eral  purpose  so  that  the  robot  can  perform  a  range  of  tasks  in  the  domain.  Ob¬ 
jective  representations  tend  to  be  bad  representations,  not  so  much  because  they 
confuse  situations  that  are  functionally  different,  but  because  they  differentiate 
between  situatioiis  that  are  functionally  equivalent.  In  objective  representations, 
each  internal  state  encodes  too  mmch  information  for  most  tasks.  This  complicates 
decision  making  by  forcing  the  agent  to  do  its  own  abstreiction. 
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a) 


Figure  4.6:  Generally  the  mapping  between  external  world  states  and  the  agent’s 
internal  representation  is  many-to-many;  a)  shows  how  two  different  situations 
can  generate  the  same  internal  state  and  b}  shows  how  one  situation  may  have 
more  than  one  internal  representation. 
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To  obtain  a  more  appropriate  representation,  it  is  necessary  to  consider  the 
specific  task  to  be  perfotmed  by  the  robot  and  the  functional  equivalence  classes 
it  entails.  This  can  be  a  problem  if  the  task  is  initially  unknown  or  changes 
over  time.  However,  active  perception  can  address  this  problem  by  providing  an 
efficient  means  for  implementing  a  wide  range  of  perceptual  mappings.  Of  course, 
in  the  case  of  active  perception,  adaptive  control  involves  the  learning  of  both 
the  overt  actions  needed  to  perform  the  task  and  the  perceptual  actions  needed 
to  sense  and  represent  the  world  properly  with  respect  to  the  task.  Examples  of 
the  many-to-many  mappings  afforded  by  the  block  stacking  robot’s  sensory-motor 
system  are  shown  in  Figure  4.6. 


4.4  A  Formal  Model  of  Embedded  Learning 

Let  us  now  formalize  concepts  such  cis  “the  world,”  “the  agent,”  “the  sensory- 
motor  system”  and  the  “decision  system”  in  a  general  model.  The  model,  shown 
in  Figure  4.7,  extends  a  model  proposed  by  Kaelbling  [Kaelbling,  1989]  by  explic¬ 
itly  representing  the  dynamic  relationship  between  external  world  states  and  the 
agent’s  internal  representation.  Given  this  model  we  can  formally  describe  the 
decision  problem  facing  the  embedded  controller. 

4.4.1  The  External  World 

The  external  world  is  modeled  as  a  discrete  time,  discrete  state,  Markov  decision 
process  and  is  described  by  the  tuple  {Se,Ae,Te',Re)-  The  ’E’  subscript  is  used 
to  emphasize  that  this  is  a  model  of  the  environment  external  to  the  agent.  The 
model  is  a  mathematical  abstraction  of  the  physical  world  that  collapses  the  in¬ 
finite  complexity  of  the  real  world  onto  a  finite  model.  However,  the  model  is 
assumed  to  capture  the  essence  of  the  task  to  be  performed  by  the  agent  (or  put 
another  way,  the  Markov  model  defines  that  task  to  be  performed).  Of  course, 
the  model  is  just  a  mathematical  abstraction,  and  the  agent  (and  especially  the 
embedded  decision  system)  has  no  knowledge  of  it. 


4.4.2  The  Agent 

Our  model  of  the  agent  has  two  major  subsystems:  a  sensory-motor  subsystem  and 
a  decision  subsystem.  The  sensory-motor  subsystem  implements  three  functions: 
1)  a  perceptual  function  P;  2)  an  internal  configuration  function  /;  and  3)  a  motor 
function  M.  The  model  of  the  sensory-motor  system  formalizes  the  relationships 
that  exist  between  1)  internal  states  and  actions  and  2)  the  external  world  model. 
On  the  sensory  side,  the  system  translates  world  states  (i.e.,  states  in  the  external 
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The  world 


The  decision  system 


The  agent 


Figure  4.7:  A  formal  model  for  an  agent  with  an  embedded  leeirning  system  and 
an  active  sensory-motor  system.  The  table  summarizes  the  functions  implemented 
by  each  of  the  model’s  components. 
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model)  into  the  states  in  the  agent’s  internal  representation.  Since  perception 
is  active,  this  mapping  is  dynamic  and  dependent  upon  the  configuration  of  the 
sensory-motor  apparatus.  Let  Si  be  the  finite  set  of  possible  internal  states,  and  C 
be  the  set  of  sensory-motor  configurations.  Then,  the  relationship  between  world 
states  and  the  agent’s  internal  representation  can  be  described  by  a  perceptual 
function  F,  which  maps  world  states  Se  and  sensory-motor  configurations  C  onto 
internal  states  S;  (i.e.,  P  :  Se'x^C  Si).  Notice  that  in  a  real  system,  the  physical 
sensory  system  implements  a  mapping  from  the  physical  world  (an  infinite  number 
of  real  world  situations)  onto  the  agent’s  internal  representation  5/,  but  in  the 
model,  the  perceptual  function,  for  a  given  configuration,  maps  states  in  Se  onto 
Si. 

On  the  motor  side,  the  agent  has  a  finite  set  of  internal  motor  commands, 
Ai,  that  affect  the  model  in  two  ways:  they  can  either  change  the  state  of  the 
external  world  (by  being  translated  into  external  actions,  Ae)^  or  they  can  change 
the  configuration  of  the  sensory-motor  subsystem.  As  with  perception,  the  config¬ 
uration  of  the  sensory-motor  system  modulates  the  effects  of  internal  commands. 
This  dependence  is  modeled  by  the  functions  M  and  7,  which  map  internal  com¬ 
mands  and  sensory-motor  configurations  into  actions  in  the  external  world  and 
into  new  sensory-motor  configurations,  respectively  (that  is,  Af  :  A/  x  C  -»  Ae 
and  7  ;  A/  X  C  -♦  C).  M  is  called  the  motor  function  and  7  is  called  the  Config¬ 
uration  function.  Internal  commands  that  change  the  state  of  the  external  world 
or  that  change  the  sensory-motor  configuration  so  as  to  affect  the  motor  mapping 
are  called  overt  actions  and  are  denoted  by  the  set  Aq.  Commands  that  change 
the  configuration  of  the  sensory-motor  system,  but  leave  the  motor  mapping  un¬ 
changed,  are  called  perceptual  actions  and  are  denoted  by  the  set  Ap. 

The  other  component  of  the  agent  is  the  decision  subsystem.  This  subsystem 
is  like  a  homunculus  that  sits  inside  the  agent’s  head  and  controls  its  actions. 
On  the  input  side,  the  decision  subsystem  has  access  to  reward  generated  by  the 
external  world  and  to  the  agent’s  internal  representation,  but  not  to  the  state  of 
the  external  world.  Similarly,  on  the  motor  side,  the  decision  subsystem  generates 
internal  action  commands  that  are  interpreted  by  the  sensory-motor  system. 


4.4.3  The  Internal  Decision  Problem 

There  are  two  decision  problems  that  need  to  be  distinguished  at  this  point:  the 
external  decision  problem  and  the  internal  decision  problem.  The  external  decision 
problem  is  defined  by  the  Markov  decision  process  used  to  objectively  describe 
the  external  world  (or  the  external  task).  The  internal  decision  problem  is  the 
control  task  facing  the  embedded  decision  system  and  is  defined  by  the  tuple 
(5/,  A;, T/,  Ri)  where 

Si  is  the  set  of  possible  input  values  generated  by  the  sensory-motor  system. 


Ai  is  the  set  of  internal  action  commands, 

Tj  is  the  internal  transition  function  which  describes  the  effects  of  internal  action 
commands  on  the  next  internal  state,  and 

Ri  is  the  internal  reward  function. 

In  this  model,  the  internal  transition  and  reward  functions  in  general  depend 
upon  the  state  of  the  external  world  and  the  configuration  of  the  sensory-motor 
system.  The  internal  transition  function  can  be  expressed  in  terms  of  the  external 
transition  function,  the  perceptual  function,  and  the  motor  function  as  follows: 

T/(i,a|i£,c)  =  P{T{xE,M{a,c)),c)  (4.1) 

where  T(a:,a|i£,c)  denotes  the  result  of  executing  internal  action  command  a  in 
internal  state  a*,  given  that  the  current  world  state  is  a;£;  and  the  current  sensory- 
motor  configuration  is  c.  Similarly,  the  internal  reward  function  can  be  expressed 
as 

i?/(a:,ala;£,c)  =  Re{xe,M{q^c)).  (4.2) 

The  objective  of  the  embedded  decision  system  is  to  learn  a  control  policy  (a 
mapping  from  internal  states  to  internal  motor  commands)  that  maximizes  the 
expected  future  return.  Notice,  however,  that  the  internal  decision  problem  may 
or  may  not  satisfy  the  Markov  property  since  in  general  transitions  and  rewards 
depend  upon  1 )  the  state  of  the  external  world,  2)  the  current  configuration  of  the 
sensory-motor  function,  and  3)  the  mappings  implemented  by  the  perceptual  and 
motor  functions.  As  will  be  shown  in  the  next  chapter,  active  perception  almost 
invariably  leads  to  decision  problems  that  are  non-Markov. 


4.4.4  Modeling  the  GB-task 

The  GB-task  can  be  formalized  using  this  model.  For  the  GB-task,  the  external 
world  can  be  defined  as  a  Markov  process  whose  state  space  consists  of  the  set 
of  all  possible  piles  of  blocks.  In  this  case,  each  state  describes  a  unique  pile  of 
blocks  —  much  as  in  an  objective  representation.  The  set  of  actions  Ae  consists 
of  actions  for  grasping  and  placing  each  and  every  block.  The  transition  function 
is  deterministic  and  follows  the  standard  dynamics  used  in  block  stacking.  The 
reward  function  is  uniformly  zero  except  for  transitions  that  yield  states  where 
the  robot  was  holding  a  block,  in  which  case  the  reward  is  Rg. 

The  agent  in  the  GB-task  is  defined  as  follows.  The  internal  state  space  5/  is 
defined  by  the  set  of  possible  values  for  the  input  bit  vector  in  Figure  4.2.  The  set 
of  internal  actions  Aj  are  the  actions  listed  on  the  right  in  Figure  4.2,  The  set  of 
configurations  C  corresponds  to  the  set  of  possible  placements  for  the  markers  on 
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objects  in  the  external  world.  The  perceptual  and  motor  functions  are  somewhat 
harder  to  write  down,  but  they  define  the  mapping  from  piles  of  blocks  in  the 
world  to  values  of  the  input  vector,  and  from  internal  action  commands  to  actions 
in  the  external  “blocks  world”  model. 


fi.') 
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5  Combining  Active  Perception 
and  Q-Learning 


The  first  program  written  to  learn  the  GB-task,  called  Meliora-I,  takes  a  most 
straightforward  approach  by  directly  applying  l*step  Q-learning  to  the  internal 
decision  problem  described  in  Section  4.4.  To  our  surprise,  Meliora-I  is  unable 
to  learn  to  solve  the  GB-task  consistently,  and  generally  performs  only  slightly 
better  than  random.  In  this  chapter,  we  describe  Meliora-I,  document  its  poor 
performance  on  the  GB-task,  and  analyze  its  failure.  The  principal  result  of  the 
chapter  is  to  show  that  the  agent’s  that  use  standard  Q-learning  (and  in  general 
reinforcement  learning  algorithms  that  use  TD-methods)  for  the  adaptive  control 
of  an  active  sensory-motor  system  will  almost  always  fail  to  learn  reliable  control 
strategies.  The  poor  performance  of  Q-learning  is  explained  by  introducing  the 
notion  of  an  inconsistent  internal  state.  Informally,  inconsistent  states  are  the 
result  of  perceptual  aliasing  and  arise  any  time  the  perceptual  mapping  is  such 
that  two  or  more  states  with  differing  action- values  in  the  external  Markov  model 
get  mapped  onto  the  same  internal  state.  Under  these  circumstances  it  is  not 
possible  for  the  agent’s  internal  action-value  function  to  represent  simultaneously 
the  different  action-values  for  the  confounded  external  states.  Consequently,  the 
action-values  for  inconsistent  internal  states  tend  to  reflect  a  sampled  average  of 
the  values  for  the  confounded  external  states.  These  averaged  values  not  only  lead 
to  inaccurate  estimates  for  the  inconsistent  states  themselves,  but  also,  through 
the  use  of  TD-methods,  introduce  errors  in  the  action-value  estimates  for  other 
internal  states.  This  results  in  inaccurate  action-value  estimates  for  the  internal 
decision  problem  and  non-optimal,  indeed  unreliable,  control  policies. 


5.1  Meliora-I 

Meliora-l  uses  the  1-step  Q-learning  algorithm  described  in  Figure  3.6.  In  this 
case,  the  states  seen  by  the  controller  are  the  values  of  the  robot’s  input  vec¬ 
tor  and  the  set  of  control  actions  is  the  robot’s  internal  motor  commands  (see 
Figure  4.2).  In  total  there  are  (or  1,048,576)  internal  states  and  14  possi- 
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ble  actions.  In  Meliora-I  a  table  is  used  to  implement  the  action-value  function. 
The  table  contains  one  action-value  estimate  (an  entry)  for  each  state-action  pair 
for  a  total  of  14,680,064  entries.  This  tabular  approach  is  a  particularly  simple 
method  for  implementing  the  action-value  function;  however,  its  simplicity  serves 
our  purposes  well.  Even  though  more  sophisticated  function  approximation  tech¬ 
niques  that  save  space  and  provide  generalization  can  be  used  (e.g.,  CM  AC’s, 
neural  networks,  classifier  systems),  they  too  suffer  from  the  same  fundamental 
difficulties  caused  by  active-perception.  Explicit  representation  of  the  action-value 
function  in  the  form  of  a  table  makes  these  interactions  more  apparent  and  easier 
to  explain. 

5.1.1  The  Experiment 

The  experiment  with  Meliora-1  proceeded  as  follows.  The  experiment  was  com¬ 
prised  of  a  series  of  runs,  in  which  the  robot  was  sequentially  presented  with 
1000  instances  of  the  GB-t<isk  (i.e.,  1000  trials).  At  the  beginning  of  each  run, 
Meliora-I’s  action-value  function  was  uniformly  initialized  to  zero.  On  each  time 
increment,  the  agent  cycled  once  through  the  control  loop  in  Figure  3.6.  This 
loop  involves  choosing  and  executing  an  action,  observing  the  state  and  reward 
that  result,  and  updating  the  action-value  and  policy  functions. 

Each  instance  (or  trial)  consists  of  a  randomly  configured  pile  of  exactly  4 
blocks  with  the  pile  always  containing  exactly  one  green  block.  Randomly  select¬ 
ing  problem  instances  allows  the  system  to  get  a  good  mix  of  easy  and  difficult 
problems.  An  easy  problem  corresponds  to  one  in  which  all  four  blocks  are  placed 
on  the  table.  In  this  case,  the  robot  need  merely  fixate  and  directly  grasp  the 
green  block.  A  more  difficult  problem  is  one  in  which  the  green  block  is  at  the 
base  of  a  stack  containing  all  four  blocks.  In  this  case,  the  robot  must  sequentially 
unstack  all  three  covering  blocks  before  grasping  the  green  one.  If  in  any  trial  the 
robot  fails  to  solve  the  problem  after  overt  actions,  it  decides  that  the  in¬ 
stance  is  too  difficult  and  moves  on  to  the  next  trial.  Limiting  the  amount  of  time 
the  robot  spends  on  any  given  problem  instance  provides  a  convenient  mechanism 
for  automatically  filtering  out  instances  that  are  far  beyond  the  agent’s  capabil¬ 
ities  at  a  given  point  in  time.  This  technique  prevents  the  robot  from  becoming 
hopelessly  stuck  on  hard  problems  during  the  initial  phases  of  learning. 


5.1.2  Results 

Because  problem  instances  are  selected  at  random,  temporally  adjacent  trials  may 
vary  widely  in  their  difficulty.  This  results  in  solution  time  traces  for  single  runs 
that  appear  very  noisy.  To  smooth  out  the  performance  curves  and  make  them 
easier  to  interpret,  the  solution  time  traces  for  multiple  runs  are  averaged  together 
and  plotted. 


Figure  5.1:  The  number  of  steps  taken  per  trial  by  Me!iora-I  versus  trial  number. 
Each  point  is  averaged  over  200  runs.  Also  shown  are  results  for  an  optimal 
controller  and  a  random  controller.  Meliora-'  performs  only  slightly  better  than 
random  and  generally  fails  to  reliably  solve  is!l  but  the  easiest  instances  of  the 
task. 


Performance  results  for  Meliora-I  are  shown  in  Figure  5.1.  The  figure  shows 
the  average,  over  200  runs,  of  the  number  of  steps  taken  per  trial  for  a  s  quence 
of  1000  problem  instances.  These  results  are  lor  =  30,  a  =  0.5,  7  =  0.9, 
and  p  =  0.9.'  Also  shown  are  plots  for  an  agent  acting  randomly  and  an  agent 
following  the  near-optimal  policy  shown  in  Figure  4.3.  The  plots  show  that  1- 
step  Q-learning  fails  to  learn  the  optimal  policy  (or  a  policy  anywhere  near  it). 
Mellora-I’s  performance  shows  a  slight  initial  improvement,  bv..  it  quickly  levels 
out  at  a  performance  that  is  only  slightly  better  than  random.  By  keeping  track 
of  the  problem  instances  Meliora-1  reliably  learns  to  solve,  we  noticed  that  it 
manages  only  to  learn  the  trivial  problems  in  which  the  green  block  is  initially 
clear.  The  learning  of  these  instances  accounts  for  the  system’s  initial  performance 
improvement.  For  all  other  instances,  Meliora-I  failed  to  leeirn  a  reasonable  control 
strategy.  In  particular,  it  never  reliably  learned  to  uncover  the  green  block. 


^Oilier  e.\penments  showed  the  performance  to  be  insensitive  to  variations  in  the  parameter 
settings 
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5.2  Definitions  and  Nomenclature 


Before  we  get  into  the  details  of  why  Meliora-I  cannot  solve  the  GB*task,  it  is 
useful  to  develop  further  our  language  for  describing  properties  and  relationships 
between  internal  and  external  decision  problems. 

To  begin,  it  is  convenient  sometimes  to  use  a  single  variable  to  denote  a  state- 
action  pair.  Thus,  the  term  decision  will  be  used  as  a  synonym  for  “state-action 
pair,”  and  the  variable  d  will  be  used  to  denote  the  state  action  pair  (s,a).  Using 
this  terminology,  we  say  thf  action-value  function  is  defined  over  the  set  of  possible 
“decisions,”  D  —  S  y.  A, 

Next,  it  is  useful  to  define  three  sets  that  help  to  characterize  the  relationships 
between  states  and  decisions  in  the  external  and  internal  decision  problems.  Given 
a  formal  description  of  an  internal  decision  problem  (5/,  /!/,  T/  f?/)  stated  in  terms 
of  an  external  model  [Se.Ae^TesRe)  and  a  sensory-motor  model  [P^IyMyC), 
define  SRep{s')  to  be  the  states  in  the  external  model  that  for  one  configuration  or 
other  of  the  sensory-motor  system  map  into  the  internal  state  $\  That  is,  SRtp{$') 
denotes  the  class  of  external  states  represented  by  s'.  Formally,  s  €  SRep{s*)  if 
and  only  if  there  exists  a  sensory-motor  configuration  c  €  C  such  that  P(s,  c)  =  s'. 
If  s  €  SRep{s')  then  we  say  that  s'  represents  s  and  that  s  is  represented  by  s'. 
Note  that  in  general  an  internal  state  may  represent  many  external  states  and  an 
external  state  may  be  represented  by  many  internal  states,  depending  upon  the 
configuration  of  the  sensory-motor  system. 

Similarly,  define  DRep{d')  to  be  the  decisions  in  the  external  model  that,  for 
one  configuration  or  other  of  the  sensory-motor  system,  map  onto  the  internal 
decision  d'  :=  (s',  o').  Formally,  d  —  (s,a)  €  DRep{d*)  if  and  only  if  there  exists 
a  sensory-motor  configuration  c  €  C  such  that  P(s,c)  ~  s'  and  A/(a',c)  =  a.  If 
d  €  DRep{d')  then  we  say  that  d'  represents  d  and  d  is  represented  by  d*.  Again  in 
an  active  sensory-motor  system,  these  relationships  are  generally  many-to-many. 

Finally,  define  Cons{$,  s')  to  be  the  sensory-motor  configurations  that  map  the 
external  state  s  onto  the  internal  state  s'.  Formally,  c  €  C’ons(s,s')  if  and  only  if 
F(s,c)  =  s'. 

Given  these  definitions,  it  is  now  possible  to  introduce  precisely  the  notion  of 
consistency.  In  particular,  an  internal  decision  is  said  to  be  consistent  if  every 
decision  it  represents  in  the  external  model  has  the  same  optimal  action-value. 
Formally, 

d'  is  consistent  iff  DRtp{d')  =  ^1'  (5-1) 

where  QE{d)  is  the  optimal  action-value  for  the  external  decision  d.  An  internal 
decision  that  is  not  consistent  is  said  to  be  inconsistent. 

Similarly,  an  internal  state  is  defined  to  be  consistent  if  all  of  its  corresponding 
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decisions  (state-action  pairs)  are  consistent.  Formally, 

5'  is  consistent  iff  is  consistent.  (5.2) 

The  concept  of  consistency  allows  us  to  sharpen  our  heretofore  intuitive  defini¬ 
tions  of  passive  abstraction  and  perceptual  aliasing.  In  particular,  passive  abstrac- 
tion  is  the  process  of  mapping,  through  perception,  multiple  external  states  onto  a 
single  consistent  internal  state.  Conversely,  perceptual  aliasing  occurs  when  mul¬ 
tiple  external  states  are  mapped  onto  a  single  internal  state  in  a  way  that  results 
in  an  inconsistency. 


5.3  The  Effects  of  Perceptual  Aliasing 

Armed  with  these  tools  for  describing  various  properties  of  internal  decision  prob¬ 
lems,  the  poor  performance  of  1-step  Q-learning  on  the  GB-task  can  be  explained, 
and  it  can  be  shown  that  most  existing  reinforcement  learning  algorithms  cannot 
be  used  to  learn  to  control  agents  with  active  sensory-motor  systems. 

The  first  observation  to  make  about  the  internal  decision  problems  of  agents 
with  active  perception  is  that  they  are  rife  with  perceptual  aliasing  and  inconsis¬ 
tent  internal  states.  Perceptual  aliasing  goes  hand  in  hand  with  active  perception 
since  the  whole  point  of  active  perception  is  to  provide  a  flexible  means  for  selec¬ 
tively  sensing  and  ignoring  selectively  various  aspects  of  the  world.  This  includes 
the  possibility  of  ignoring  relevant  information,  which  in  turn  leads  to  percep¬ 
tual  aliasing.  Therefore,  the  internal  decision  problem  facing  the  controller  of  any 
system  that  uses  active  perception  will  necessarily  contain  inconsistent  internal 
states.  Indeed,  since  in  most  cases  careful  control  of  the  sensory  system  is  required 
to  register  all  of  the  relevant  information,  the  vast  majority  of  internal  states  are 
likely  to  be  inconsistent. 

Another  observation  to  make  is  that  internal  decision  problems  that  contain 
inconsistent  internal  states  are  necessarily  non-Markov.  This  follows  since,  for 
inconsistent  states,  knowledge  of  the  current  stale  is  not  sufficient  to  characterize 
the  dynamics  of  the  process  completely.  Additional  information,  namely  knowl¬ 
edge  of  the  hidden,  external  state,  can  be  used  to  improve  predictions  about  the 
future  of  tlie  process  —  a  clear  violation  of  the  Markov  property.  This  puts  us  on 
shaky  ground  with  respect  to  1-step  Q-learning  since  it  has  only  been  shown  to 
converge  to  the  optimal  policy  for  Markov  decision  problems.  Indeed,  as  we  shall 
see,  inconsistent  states  and  non-Markov  decision  problems  wreak  havoc  on  1-step 
Q-learning.^ 

^When  I  sa>  ■■  it  has  only  been  shown  to  converge  for  Markov  decision  problems” ,  I  do  not 
wish  to  minimize  the  significance  of  Watkins’  convergence  result  for  1-step  Q-learning.  This 
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The  trouble  with  inconsistent  internal  states  is  that  they  prevent  the  embed¬ 
ded  decision  system  from  accurately  estimating  and  representing  the  true  utility 
of  applying  an  internal  action  at  a  given  point  in  time.  Given  an  inconsistent 
internal  decision  d'  =  (s',  a'),  the  single  action- value  maintained  by  the  Q-learner, 
denoted  Qi{d'),  cannot  simultaneously  represent  the  disparate  action-values  for 
the  external  decisions  in  DRep{d!).  This  means  that  regardless  of  the  value  of 
Qi{d')  there  will  be  points  in  time  when  the  internal  state  is  s'  and  Qj[s'ya!)  will 
not  accurately  reflect  the  true  utility  of  executing  the  action  o'.  At  these  crucial 
instants  in  time,  the  internal  action-value  estimate  and  optimal  action-value  for 
the  external  decision  it  represents  differ  (or  are  inconsistent)!  These  events  are 
called  utility  aberrations. 

Utility  aberrations  are  certain  to  occur  in  systems  with  inconsistent  internal 
states  and  can  impair  decision  making  both  locally  and  globally. 

A  local  impairment  occurs  when,  because  of  an  inability  to  accurately  repre¬ 
sent  the  action-values  of  an  inconsistent  state,  a  non-optimal  value  is  incorrectly 
assigned  to  the  policy  value  of  the  inconsistent  state.  A  local  impairment  can 
occur  if  either  an  optimal  or  non-optimal  decision  is  inconsistent.  If  the  optimal 
decision  for  a  given  state  is  inconsistent,  and  if  the  action- value  estimate  for  that 
decision  dips  below  the  action-value  estimate  for  a  non-optimal  action,  because 
of  a  utility  aberration,  then  the  agent  will  incorrectly  prefer  the  non-optimal  ac¬ 
tion  over  the  optimal  one.  Similarly,  if  a  non-optimal  decision  is  inconsistent  and 
the  utility  aberration  is  such  that  its  action-value  estimate  is  increased  over  the 
action-value  for  the  optimal  decision,  then  the  agent  will  incorrectly  prefer  the 
non-optimal  action. 

Utility  aberrations  can  also  have  global  effects,  leading  to  non-optimal  behavior 
in  states  that  are  otherwise  consistent.  Global  impairments  occur  when  inaccurate 
utility  estimates  from  inconsistent  states  are  used  to  update  the  action-value  esti¬ 
mates  for  other  (potentially  consistent)  states.  For  instance,  in  1-step  Q-learning, 
the  action- value  estimate  for  the  state-action  pair  executed  at  time  t  is  updated 
according  to  the  rule, 

Qi{st,at)  *-  (1  -a)(?/(s,,a,)  +  a[r, -f  7t//(st+i)], 

where  t//(5»+i)  =  maxog^,  Q/(5(+i,a).  Unfortunately,  if  Sj+i  is  inconsistent  then 
may  be  inaccurate  with  respect  to  the  current  situation  in  the  external 
world  (due  to  a  utility  aberration)  and  the  1-step  estimator,  r,  +  7f//(st+i),  may 
be  incorrect,  thus  introducing  an  error  into  Q/(si,at).  This  error  may  now  lead 

is  an  extremely  general  result  and  one  of  the  few  formal  theorems  in  reinforcement  learning. 
Indeed,  Markov  models  are  the  foundation  of  most  work  on  sequential  optimization  problems. 
Non-Markov  models,  because  of  their  mathematical  intractability,  tend  to  be  at  the  fringe. 
Unfortunately,  acti\e  perception  invariably  leads  to  Non-Markov  models.  Therefore,  we  have  no 
choice  but  to  try  to  deal  with  them. 
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to  non-optimal  behavior  in  state  St  and  an  inaccurate  estimate  of  which 

in  turn  may  infect  other  states,  and  so  on. 


5.4  A  Simple  Example 

To  illustrate  the  extent  to  which  perceptual  aliasing  can  interfere  with  Q-leaming, 
let  us  consider  a  simple  example.  Consider  the  task  shown  in  Figure  5.2.  In  this 
task,  the  external  decision  problem  has  a  state  space  contdning  eigiit  states, 
Se  =  actions,  Ae  =  {a/,ar};  and  a  deterministic 

transition  function,  shown  in  Figure  5.2a.  The  goal  of  the  external  task  is  to 
enter  the  goal  state  g,  whereupon  the  agent  recMves  a  fixed  reward  Re{.9)  =  5000. 
Non-goal  states  yield  zero  reward,  RE{sk)  =  0  for  I;  =  0  to  6. 

The  optin.al  value  function  for  the  external  task,  denoted  V^,  is  an  expo¬ 
nentially  decreasing  function  of  the  distance  to  the  goal.  That  is,  V£{s)  = 
where  d{s)  is  the  distance  (in  steps)  from  state  s  to  the  goal. 
The  optimal  policy,  tt^,  corresponds  to  choosing  the  action  that  minimizes  the 
distance  to  the  goal.  In  this  case,  the  optimal  policy  requires  the  agent  to  moving 
right  (flr)  at  every  opportunity  (i.e.,  for  all  s  €  Se,  t^e{s)  =  Cr).  Notice  that 
the  optimal  solution  path  for  a  given  trial  traces  out  a  trajectory  where  V£(xt) 
is  monctonically  increasing  in  time,  and  that  the  optimal  policy  corresponds  to 
performing  a  gradient  ascent  of  V^.  This  result  is  illustrated  in  Figure  5.3a,  which 
plots  V^{Xt)  versus  time  for  a  trial  that  begins  in  state  So  at  time  <  =  0  and  follows 
the  optimal  trajectory  to  g  at  time  <  =  7.  When  applied  directly  to  this  problem, 
1-step  Q-learning  can  easily  learn  the  optimal  policy.  However,  let  us  introduce 
an  inconsistency  and  see  what  happens. 

Consider  the  internal  decision  problem  that  results  when  the  agent’s  sensory- 
motor  system  implements  a  peiceptual  n  apping  that  is  fixed,  one-to-one,  and 
onto  except  for  states  S2  and  sj,  which  get  mapped  onto  the  same  internal  state, 
Sj  j.  That  is.  let  Sj  =  {5o,Sj,S2.5» 53,54,56,^'},  where  except  for  S2,5> 
g')  represents  world  state  Sj  (and  g).  Also  let  the  motor  mapping  be  such  that 
At  =  {aj, a^},  where  aj  and  a'  map  to  c/  and  o^,  respectively.  The  transition 
diagram  for  this  internal  decision  problem  is  shown  in  Figure  5.2b,  Note  that  this 
problem  is  non  Markov  since  the  effects  of  actions  are  not  independent  of  the  past 
but  depend  upon  the  hidden,  unperceived  external  state.  Also  note  that  a  fixed 
optimal  policy  for  this  task  is  to  always  apply  the  action  a' . 

1-step  Q-learning  cannot  learn  the  optimal  policy  for  this  task.  In  particular,  if 
the  agent’s  policy  is  initialized  to  the  optimal  policy  and  the  controller  is  fixed  so 
that  the  system  follows  the  optimal  policy  with  probability  p  =  0.99  and  chooses 
a  random  action  otherwise,  and  if  the  system  is  run  for  many  trials  and  allowed 
to  estimate  the  optimal  va,ue  and  action- value  functions,  then  the  following  is  ob¬ 
served.  First,  since  the  va'ue  and  action-value  estimates  {Uj  and  Qj  respectively) 


7.3 


a) 


Figure  5.2;  Transition  diagrams  for  a  simple  decision  task:  a)  the  transition  dia¬ 
gram  for  the  external  decision  problem,  b)  the  transition  diagram  for  the  internal 
(or  perceived)  decision  problem  when  interpreted  through  a  sensory-motor  system 
with  perceptual  aliasing. 
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a) 


b) 


Figure  5.3;  Plots  of  utility  versus  time  as  the  agent  traverses  from  state  sq  at 
t  =  Qiog&tt  =  l  (for  7  =  0.8):  a)  the  utility  for  the  external  decision  problem, 
V^;  b)  the  utility  estimates  for  the  internal  decision  problem,  [//,  obtained  by  the 
1-step  Q-learning  algorithm. 
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are  based  on  expected  returns,  for  the  state  they  take  on  values  somewhere 
between  the  corresponding  values  for  S2  and  ss  in  the  external  decision  problem. 
That  is, 


V£(S2)  <  t//(4.s)  <  Vlis,), 

(5.3) 

^£(S2>flr)  <  Ql{^2,St^r)  — 

(5.4) 

—  ^/(•*2,57  4)  ^  Q^C^SjO/)- 

(5.5) 

Actually,  the  estimated  action-value  function  does  not  even  converge  to  the  true 
sampled  average  of  the  returns  observed.  This  follows  since  to  update  its  action- 
value  function,  the  agent  uses  a  1-step  estimator  which  enforces  only  local  con¬ 
straints  on  tlie  values  estimated.  If  the  learning  rate  is  gradually  decreased  with 
time,  the  action-value  function  estimated  by  the  agent  converges  to  the  values 
that  satisfy  the  following  local  relationships: 

(?/(4,a;)  =  /?(5')  +  70  =  5000 
=  /i[0  +  iU,{4)]  +  /2[0  +  7t//(4)] 

Q/(4>«r)  =  0  +  7f^/(4,5) 

=  0  +  7[//(4) 

Q/(5i,a')  =  0  +  7t^/(4,5) 

Qi{sQ,a'^)  =  0  +  jUj{s[) 

Qii4-  4)  =  0  +  7t^/(4.5) 

Q/(4.5,4)  =  /i'(o  +  7W)]  +  /2[o  +  iUj{s\)] 

Q/(4’4)  =  o  +  7t^/(4) 

Qt(4^4)  =  o  +  7f^/(4,5) 

Q/(4’4)  =  o-f  7f^/(4) 

QiM)  =  o-^yU,{4} 

where  Ui{x)  =  maXa€{a;.a'}  C?/(a:,a). 

In  Equation  5.7,  /i  and  /2  are  the  fraction  of  times  the  application  of  a'  in 
state  52  5  results  in  the  next  states  being  Sj  and  Sg,  respectively.  Similarly,  in 
Equation  5.13,  /,'  and  are  the  fraction  of  times  the  application  of  a{  in  state 
52,5  results  in  states  s,  and  5^,  respectively.  If  trials  always  begin  in  state  Sq,  then 
/i  =  /2  =  f{  =  f 2  =  50%  and  the  values  for  the  utility  and  action-value  functions 
will  converge  on  the  values  shown  in  Table  5.1.  Also  shown  in  the  table  are  the 
sampled  utility  and  action-values  (^5  and  Qs,  respectively),  obtained  by  actually 
measuring  and  averaging  the  returns  received  over  many  trials  (instead  of  using  a 
!-step  estimator).  Notice  that  the  sampled  averages  match  the  optimal  values  for 
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lEfli 

■n 

KM 

■» 

HI 

mm 

■ion 

1882 

2352 

2352 

2352 

2941 

5000 

Qiis,a',) 

1506 

1506 

2352 

1882 

1882 

2352 

emm 

1882 

2352 

1882 

2352 

2941 

5000 

■MBM 

1310 

1638 

2560 

3200 

3024 

5000 

1048 

1048 

1310 

1638 

1935 

3200 

<5s(5,a') 

1310 

1638 

2560 

3200 

3024 

5000 

Table  5.1:  The  utility  and  action-value  functions  estimated  by  the  1-step  Q- 
learning  algorithm  and  the  true  sampled  utility  and  action-value  functions.  The 
estimated  functions  do  not  match  the  true  sampled  values  since  they  are  obtained 
by  satisfying  the  local  constraints  imposed  by  the  corrected  1-step  estimator.  The 
estimated  utility  and  action-values  are  denoted  Ui  and  Qi^  respectively,  and  the 
sampled  utility  and  action-values  are  denoted  Vs  and  Qg.  The  values  shown  are 
for  7  =  0.8. 

the  external  decision  task  except  for  states  and  Ss,  where  utility  aberrations 
occur.  In  this  case, 

=  UmM  +  l/2F^(»s)  (5.18) 

C3s(4,s.<>1)  =  1/20e(«!,o,)  +  l/2Qi(ss,<.,)  (5.19) 

QsC-Sw.  “'r)  =  1/2(?e(^!.“.)  +  IWe)**-”.)-  (5.20) 

Also  notice  that  the  utility  and  action-values  estimated  by  1-step  Q-learning, 
except  for  state  se,  do  not  match  either  the  external  or  the  sampled  utility  and 
action  values.  This  discrepancy  arises  because  estimates  for  all  the  states  up  to 
ss  (i.e.,  5o, -s'l, 52,5- internal  task  are  either  directly  or  indirectly 
dependent  upon  the  utility  estimate  for  Ss.  However,  since  S2  and  ss  are  indis¬ 
tinguishable,  their  internal  action-value  estimates  are  constrained  to  be  the  same. 
Thus,  a  utility  aberration  occurs  whenever  the  world  is  in  Ss.  This  inaccurate 
utility  estimate  in  turn  gets  propagated  back  to  all  the  states  that  precede  it. 

The  next  important  observation  to  make  is  that  the  utility  function  (either 
learned  or  measured)  for  the  internal  decision  problem  is  no  longer  monotonically 
increasing  as  the  system  traverses  the  optimal  solution  trajectory.  This  anomaly 
is  shown  graphically  in  Figure  5.3b,  which  plots  C//(i<)  as  a  function  of  time  as  the 
system  follows  the  optimal  trajectory  from  sj,  to  g'.  The  plot  shows  that  a  utility 
aberration  occurs  at  t  =  2  when  the  system  first  encounters  62,5.  In  reality,  the 
world  is  in  state  $2  and  the  true  expected  return  is  V^{s2)  =  2048  (for  7  =  0.8). 
However,  because  $2  and  S5  are  indistinguishable  in  the  internal  representation, 
the  internal  decision  system  overestimates  the  expected  return  at  f  =  2.  Similarly, 
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another  utility  aberration  occurs  the  second  time  Sj.s  is  encountered,  at  t  =  5  when 
the  world  is  in  state  S5.  In  this  case,  Ui{s2fi)  underestimates  the  expected  return. 

If  we  relax  our  hold  on  the  decision  policy  and  allow  the  system  to  adapt, 
we  find  that  the  optimal  policy  is  unstable!  Not  only  is  the  system  unable  to 
find  the  optimal  policy,  it  actually  moves  away  from  it.  In  general,  the  system 
will  oscillate  among  policies,  never  finding  a  stable  one.  The  instability  can  be 
understood  by  considering  the  effect  of  utility  aberrations  on  the  policy.  Recall 
that  in  Q-learning  the  system  locally  adjusts  its  policy  in  order  to  maximize  the 
expected  return.  Thus,  after  running  the  agent  with  a  fixed  policy  for  many 
trials  and  then  rekcising  it,  the  policy  value  for  state  S3  will  be  cheinged  so  that 
the  system  tends  to  take  actions  that  move  it  back  to  Sj^j  instead  of  forward  to 
S4  (since  Q/(s3,a{)  >  ^7(53, a')).  The  large  utility  value  for  state  Sj^s  acts  as  an 
attractor  for  nearby  states,  such  as  S3,  and  causes  them  to  change  their  local  policy 
away  from  optimal.  An  intuitive  way  to  understand  the  problem  is  to  consider  a 
local  homunculus  that  sits  at  S3  and  can  see  the  utilities  of  its  neighbors.  From 
his  point  of  view,  Sj^s  looks  more  desirable  than  s^  since  once  the  system  is  in  Sj^j 
it  can  execute  a' ,  which  often  leads  to  Sg  (one  step  from  the  goal).  On  the  other 
hand,  choosing  the  action  which  leads  to  s^  leaves  the  system  still  three  steps  from 
the  goal.  From  the  homunculus’  point  of  view,  going  to  Sj^g  is  on  average  better 
than  going  to  s^.  What  the  homunculus  cannot  perceive  (because  of  perceptual 
aliasing)  is  that  going  from  S3  directly  to  always  returns  the  real  external  world 
to  state  S2,  which  cannot  reach  directly.  The  problem  is  that  the  homunculus 
cannot  distinguish  between  Sj  and  S5,  as  they  are  both  represented  by  52,5,  and  it 
erroneously  makes  the  Markov  assumption  —  that  the  effects  of  actions  depend 
only  upon  the  current  perceived  state. 

The  utility  aberrations  are  also  unstable  since  they  are  based  on  a  running 
average  of  the  expected  returns.  If,  because  of  policy  changes,  S5  is  rarely  visited, 
the  aberration  at  S2  will  disappear.  Unfortunately,  as  soon  as  the  policy  is  changed 
so  that  S5  begins  to  be  encountered  more  frequently,  the  aberration  reappears,  and 
so  on.  Thus,  the  system  oscillates  from  policy  to  policy,  unable  to  converge  on  a 
stable  one. 


5.5  Looking  Back  at  Meliora-I 

It  is  now  easy  to  see  why  Meliora-I  cannot  learn  to  solve  the  GB-task.  In  Meliora- 
I,  the  internal  state  space  is  rife  with  inconsistencies.  Indeed,  the  only  time  the 
internal  state  is  consistent  is  when  one  of  the  markers  (either  the  action  or  the 
attention-frame)  is  on  the  green  block.  All  other  configurations  of  the  sensory- 
motor  system  generate  inconsistent  internal  states. 

An  example  of  an  inconsistent  internal  state  is  shown  in  Figure  5.4.  This  figure 
shows  four  external  states,  each  with  different  optimal  utilities,  that  under  appro- 


Figure  5.4:  Four  different  external  states,  each  with  different  utilities,  can  generate 
the  same  internal  state.  In  this  case,  the  distance  to  the  goal  state  for  each  of 
these  states  is  15,  12.  7.  and  2  steps,  respectively.  The  utility  estimate  for  the 
internal  state  tends  to  reflect  an  average  of  the  utilities  of  the  external  states  it 
represents.  If  we  just  consider  these  four  external  states  (indeed  the  equivalence 
class  for  this  internal  state  contains  many  more  external  states),  then  the  utility 
of  the  internal  state  would  correspond  roughly  to  a  state  approximately  9  steps 
away  from  the  goal.  When  representing  external  states  like  those  on  the  top, 
the  internal  state  tends  to  overestimate  the  utility  of  the  situation  (aberrational 
maxima)  and  when  representing  situations  like  those  on  the  bottom,  the  internal 
state  tends  to  underestimate  the  utility  of  the  situation  (aberrational  minimum). 
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priate  conditions  generate  the  same  internal  state  (shown  below).  The  utility  and 
action-values  for  this  internal  state  tend  to  reflect  an  average  of  the  utilities  and 
action-values  for  the  external  states  it  represents.  When  encountered  in  opera¬ 
tion,  it  is  unlikely  that  this  state’s  utility  estimate  will  match  the  actual  utility  of 
the  current  external  state  —  sometimes  it  will  overestimate  and  sometimes  it  will 
underestimate. 

To  see  how  this  inconsistent  state  interferes  with  decision  making,  consider  the 
situation  shown  at  the  top  in  Figure  5.5.  In  this  case,  the  green  block  is  covered 
by  three  blocks,  the  action-frame  marks  the  top  block,  and  the  attention-frame 
marks  the  green  block.  The  internal  state  generated  by  this  arrangement  is  con¬ 
sistent  and  shown  directly  below  the  pile.  Consider  the  result  of  performing  two 
different  actions.  First,  suppose  the  agent  performs  the  grasp  action.  This  action 
results  in  the  external  and  internal  states  shown  on  the  left.  The  internal  state 
for  this  situation  is  also  consistent.  Now  suppose  that  instead  of  performing  the 
grasp  action  the  agent  performs  the  move-attention-frame-to-table  action.  This 
action  results  in  the  external  and  internal  states  shown  on  the  right  Figure  5.5. 
In  this  case,  the  internal  state  is  the  inconsistent  state  from  Figure  5.4.  Even 
though  the  utility  of  the  external  state  on  the  left  in  Figure  5.5  is  greater  than  the 
external  state  on  the  right  (a  minimum  distance  of  16  versus  14  steps),  Meliora-I 
prefers  move-attention-frame-to'green  over  grasp-object-at-action'frame.  This  fol¬ 
lows  since  the  utility  estimate  for  the  internal  state  on  the  right,  due  to  averaging, 
tends  to  be  higher  than  the  utility  of  the  consistent  internal  state  on  the  left. 
The  situation  on  the  left  corresponds  to  a  state  of  knowledge  in  which  the  agent’s 
internal  state  encodes  the  truth  about  the  utility  of  the  world,  whereas  the  situ¬ 
ation  on  the  right  corresponds  a  state  of  ignorance.  When  viewed  by  an  outside 
observer,  Meliora-I  appears  to  take  an  “ignorance  is  bliss”  approach  to  control. 
That  is,  when  it  encounters  a  situation  in  the  external  world  where  it  discovers 
the  utility  is  low  (i.e.,  a  great  deal  of  work  must  be  done  to  obtain  some  reward), 
instead  of  gritting  its  teeth  and  getting  on  with  it,  it  prefers  to  ignore  the  problem 
by  focusing  its  attention  elsewhere  (i.e.,  since  states  of  ignorance  appear  to  be 
better  off  on  average  than  states  that  are  known  to  be  far  from  the  goal). 


5.6  The  Long  Arm  of  Perceptual  Aliasing 


The  difficulties  caused  by  perceptual  aliasing  and  inconsistent  internal  states  are 
not  unique  only  to  1-step  Q-learning.  Indeed,  any  reinforcement  learning  algo¬ 
rithm  that  uses  any  form  of  truncated  corrected  return  is  subject  to  the  detri¬ 
mental  effects  of  perceptual  aliasing.  This  includes  the  whole  family  of  Q-learnir.g 
algorithms,  algorithms  based  on  temporal  difference  methods,  and  the  bucket 
brigade  algorithm  commonly  used  in  classifier  systems. 
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Figure  5.5:  Three  pairs  of  external  and  internal  states  from  the  GB-task.  External 
states  are  depicted  as  piles,  and  internal  states  are  depicted  as  bit  vectors.  The 
+  indicates  the  position  of  the  action-frame;  the  *  indicates  the  attention-frame. 
At  the  top  is  a  situation  in  which  the  attention-frame  is  bound  to  the  green  block 
and  the  action-frame  is  bound  to  the  top  block  in  the  stack;  the  corresponding 
internal  state  is  consistent.  On  the  left  is  the  situation  that  results  from  executing 
the  optimal  action,  grasp-at-action-frame  in  the  top  state;  this  internal  state  is 
also  consistent.  On  the  right  is  the  situation  that  results  from  executing  the  non- 
optimal  action  move-atteniion-frame-to-table\  this  internal  state  is  inconsistent, 
and,  due  to  utility  aberrations,  tends  to  have  a  higher  utility  estimate  than  the 
internal  state  on  the  left. 
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Also,  as  previously  mentioned,  perceptual  aliasing  always  accompanies  adap¬ 
tive  systems  that  use  active  perception  since  in  general  a  flexible  sensory-motor 
system  can  always  be  configured  to  ignore  relevant  information.  Indeed,  inconsis¬ 
tent  states  are  likely  to  be  pervasive  since  in  many  cases  only  careful  configuration 
of  the  sensory-motor  system  will  lead  to  consistent  internal  states.  For  instance, 
in  the  GB-task,  consistent  internal  states  are  achieved  only  when  the  green  block 
is  marked.  All  other  configurations  lead  to  inconsistencies. 

Finally,  the  negative  effects  of  perceptual  aliasing  need  not  arise  only  in  systems 
with  active  perception.  In  general,  perceptual  aliasing  accompanies  all  abstraction 
or  generalization  mechanisms  —  that  is,  any  time  it  is  possible  to  ignore  infor¬ 
mation  that  is  relevant  to  decision  making  and  utility  estimation.  For  instance  in 
an  example  similar  to  the  one  described  in  Figure  5.2,  Grefenstette  [Grefenstette, 
1988]  has  shown  how  strength  averaging  in  the  rules  of  a  classifier  system  prevents 
the  system  from  learning  an  optimal  control  strategy.  In  this  case,  rules  that  match 
multiple  world  states  (allowed  to  improve  generalization)  exhibit  perceptual  alias¬ 
ing  and,  as  a  result,  are  vulnerable  to  inconsistencies  and  inaccurate  utility  esti¬ 
mates.  Other,  similar  examples  of  perceptual  aliasing  have  been  described  recently 
by  Chapman  and  Kaelbling  [Chapman  and  Kaelbling,  1991]  and  Tan  [Tan,  1991a; 
Tan,  1991b]. 
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6  Learning  Consistent 
Representations 


This  chapter  describes  Meliora-II,  a  program  that  successfully  learns  the  GB-task 
[Whitehead  and  Ballard,  1990;  Whitehead  and  Ballard,  1991a].  The  basic  idea 
underlying  Meliora-II  is  to  learn  and  use  an  internal  representation  that  is  com¬ 
plete  and  consistent.  Instead  of  freely  mixing  perceptual  and  overt  actions  in  a 
monolithic  controller  that  permits  inconsistent  states  in  the  internal  representa¬ 
tion,  Meliora-II  divides  control  into  two  distinct  phases:  state  identification  and 
overt  control.  During  state  identification,  a  consistent  internal  state  is  generated 
by  executing  perceptual  actions  in  order  to  configure  the  sensory-motor  system 
properly.  This  consistent  internal  state  is  then  made  available  to  an  overt  con¬ 
troller,  which  generates  the  next  overt  action.  Both  the  state  identification  and 
overt  control  processes  are  adaptive.  The  state  identification  module  is  adapted 
when  it  is  found  to  generate  internal  states  that  are  inconsistent.  The  overt  con¬ 
trol  module  is  adapted  based  on  rewards  gained  through  interactions  with  the 
world. 

In  Meliora-II  overt  control  is  achieved  by  using  a  slightly  modified  version 
of  1-step  Q-learning,  which  aims  to  learn  the  internal  equivalent  to  the  external 
optimal  policy.  State  identification  is  achieved  by  learning  a  perceptual  policy 
that  maps  input  vectors  into  perceptual  actions.  To  generate  an  internal  state,  a 
sequence  of  perceptual  actions  is  performed  and  a  set  of  candidate  internal  states 
(state  vectors)  is  generated.  One  of  these  internal  states  is  then  chosen  to  rep¬ 
resent  the  current  external  world  state.  The  detection  of  inconsistent  internal 
states,  used  to  update  the  perceptual  policy,  ij  accomplished  by  monitoring  the 
sign  in  the  estimation  error  in  the  1-step  Q- 'earning  rule.  These  specific  tech¬ 
niques  are  adequ.ate  for  Meliora-II  and  the  GB-task;  however,  other  more  general 
techniques  exist  as  well.  In  general,  Meliora-II  represents  a  specific  example  of  a 
general  approach  to  the  adaptive  control  of  perception  and  action  that  we  call  the 
Consistent  Representation  (or  CR)  Method.  Algorithms  and  architectures  based 
on  this  method  all  share  the  same  basic  idea,  which  is  to  break  control  into  state 
identification  and  overt  control  and  to  generate  an  internal  representation  that  is 
consistent  at  each  point  in  time.  After  describing  the  specific  algorithm  used  by 


Meliora-II  and  sonae  experimental  results,  the  discussion  turns  to  the  CR-method 
in  general.  This  discussion  includes  a  description  of  the  basic  architecture  of  sys¬ 
tems  that  use  the  CR-method;  a  review  of  Meliora-II  in  terms  of  this  architecture; 
a  discussion  of  some  alternatives  to  the  algorithms  used  in  Meliora-II;  and  a  brief 
review  of  two  other  systems  that  also  employ  the  CR-method.  The  chapter  con¬ 
cludes  with  a  discussion  of  the  limitations  of  Meliora-II  etnd  of  the  CR-method  in 
general. 


6.1  Meliora-II 

Meliora-II  represents  our  solution  to  the  problem  of  perceptual  aliasing.  That 
is,  Meliora-II’s  decision  system  is  designed  specifically  to  be  embedded  within  an 
agent  with  an  active  sensory-motor  system  and  to  control  perception  actively  to 
overcome  the  negative  effects  of  perceptual  aliasing. 

The  decision  system  used  in  Meliora-II  is  based  on  three  tenets: 

1.  In  active  perception  a  world  state  can  be  represented  by  multiple  internal 
states,  one  of  which  is  usually  consistent.  That  is,  if  the  agent  looks  around 
enough  it  will  eventually  attend  to  those  objects  that  are  relevant  to  the 
task,  and  the  internal  state  associated  with  that  sensory  configuration  will 
be  consistent.  Our  algorithm  depends  on  the  existence  of  one  consistent 
internal  stale  for  each  world  state. 

2.  Inconsistent  states  disrupt  the  decision  system’s  ability  to  learn  by  promising 
inaccurate  estimates  of  the  expected  return.  Detecting  inconsistent  states 
and  eliminating  their  participation  in  decision  making  and  action-value  es¬ 
timation  minimizes  their  negative  effects. 

3.  If  the  world  is  deterministic,  then  inconsistent  states  will  (because  of  averag¬ 
ing)  periodically  overestimate  the  utility  of  the  actual  world  state,  whereas 
the  incidence  of  overestimation  in  consistent  states  can  be  made  to  diminish 
with  time.  Therefore,  inconsistent  states  can  be  detected  by  monitoring  the 
sign  of  the  estimation  error  in  the  1-step  Q-learning  rule,  Equation  3.40. 


6.1.1  The  Overt  Cycle 

The  decision  procedure  used  by  Meliora-II  is  called  the  Lion  algorithm  and  is 
outlined  in  Figure  6.1.  The  main  loop  in  the  decision  procedure  is  the  overt  cycle, 
which  concerns  itself  with  choosing  overt  actions  in  an  attempt  to  maximize  the 
expected  future  return.  Embedded  within  the  overt  cycle  is  a  perceptual  cycle 
(the  identification  stage).  After  each  overt  action,  the  system  executes  a  sequence 
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Overt  Cycle; 

1)  Execute  Perceptual  Cycle  and  generate  5|,  a  set  of  internal 
representations  for  the  current  world  state. 

2)  Choose  an  internal  state,  the  /ion,  to  represent  the  current  world  state  by  selecting 
the  state  with  the  largest  overt  utility  estimate: 

lion  =  argmax,€5,[VV(s)]. 

3)  Estimate  the  utility  of  the  current  world  state,  s,:  V£(s,)  ♦-  Vi{lion). 

4)  Execute  Update*Overt<Q*Estimate8  based  on  V£(si),  ri-j,  oacti-)  and  liont-i\ 
where  ri  is  the  reward  received  at  time  t,  oactt-\  is  the  last  overt  action  executed, 
and  /totit-i  is  the  internal  state  selected  to  represent  the  previous  world  state. 

5)  Choose  the  next  overt  action  to  execute: 

With  probability  p  follow  policy  /o(/ion), 

oad  —  argmax«€Xo(P/(/«‘>".fl)l- 
Otherwise  choose  randomly:  oari  ^  Raniom{Ao) 

6)  Execute  oact  to  obtain  a  new  world  state  and  a  reward  rt . 

7)  Go  to  1). 

Uptlatc«Ovci't-Q«Estimfttes; 

1)  Estimate  the  error  in  the  lion’s  action-value:  £/,<,„  ♦-  (r,»j  +  yV^Cs,))  -  Qi(liont~i,oaett~i). 

2)  Update  the  action- value  of  the  hon: 

If  (£ii«n  <  0)  then  the  lion  is  suspected  of  being  inconsistent,  so  suppress  it: 
Qi{lion,.\,oacti.\)  —  0.0 

Else  update  it  using  the  standard  1-step  Q-learning  rule: 

0/(/io«,_j,ooc/,-i)  —  Q/(/ion(_i.ooc/,-i)  +  o£/,o„. 

3)  Update  non-lion  internal  states: 

For  each  s  €  5|.i  and  s  9^  /io»|.)  do: 

Let  E,  =  r,>,  -I-  -)Vb{s,)  -  Q/(s,oact,>j) 

If  {E,  >  0)  then  s  is  suspected  of  being  inconsistent,  so  suppress  it: 

Q/(s.oac/,_i)  —  0.0 
Else  update  it  using  the  lion’s  error: 

Qi{s,oact,.\)  —  Q I {s,  oact t^i)  •¥  a' E hon  —  where  a'  <  o. 

Perceptual  Cycle; 

1)  Initialize  St'  5,  «—  {sc}.  where  St  is  the  current  internal  state. 

2)  Do  n  times:  (in  our  implementation  n  =  4) 

i)  Select  the  ne.M  perceptual  action: 

With  probability  p'  follow  the  perceptual  policy:  pact  fp{s,)y 
where  Sc  is  the  current  input  vector  and  fp{sc)  =  arg maXae.A^[Q/(sc ■<>)]• 

Otherwise  choose  randomly;  pact  *-  Random{Ap). 

ii)  Execute  pact  to  obtain  a  new  internal  state  s'. 

iii)  Update  the  action-value  estimate  for  the  perceptual  decision  (sepact): 

Qllscpact)  -  (?7(Sc,pact)-t-Q(Vi(s')-(?/(So,pocf)). 

iv)  .Add  s'  to  5(-  5i  =  5(  U  {s'} 

v)  Update  Sj:  s<.  —  s'. 

3)  Return  St 


Figure  6.1:  .An  outline  of  the  decision  procedure  implemented  by  Meliora-II.  This 
procedure  is  called  the  hon  algorithm  and  is  designed  specifically  to  overcome  the 
difficulties  caused  by  perceptual  aliasing. 
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of  perceptual  actions  (the  perceptual  cycle)  in  an  attempt  to  identify  an  internal 
state  that  consistently  represents  the  current  external  state.  The  input  vectois 
observed  during  the  perceptual  cycle  define  a  set  of  candidate  internal  states, 
Sf.  Each  of  these  internal  states  corresponds  to  a  different  view  (representation) 
of  the  current  external  world.  The  state  chosen  to  represent  the  current  situation 
is  called  the  liov}  and  is  simply  the  internal  state  in  5<  with  the  largest  utility 
estimate.  That  is, 

/ton  =  arg  max(V/(s)]  (6.1) 

s€Si 

where  V/(s),  the  overt  utility  estimate,  is  the  maximum  overt  action-value  for 
state  s.  That  is, 

^/(s)  =  rnax{Q/(5,a)).  (6.2) 

a€Ao 

Once  the  lion  is  selected,  the  utility  of  the  current  external  state,  V£(x<),  is  esti¬ 
mated  using  the  overt  utility  estimate  for  the  lion,  V/(/»on).  As  will  be  described 
below,  the  algorithm  for  adjusting  the  action-value  estimates  for  overt  actions 
severely  lowers  the  values  for  inconsistent  decisions.  Consequently,  the  lion  tends 
to  be  consistent  and  overt  control  tends  to  be  based  on  action-value  estimates 
that  accurately  reflect  the  true  state  of  the  external  world. 

Once  \  £(X{)  has  been  estimated,  action-value  estimates  for  the  previous  overt 
action  are  updated  (as  described  below).  The  overt  cycle  then  continues  by  se¬ 
lecting  an  overt  action  to  execute.  With  probability  p,  the  system  chooses  the 
action  consistent  with  its  overt  policy  fo;  the  rest  of  the  time  it  chooses  an  overt 
action  at  random.  When  following  policy,  the  action  executed  is  simply  the  overt 
action  that  maximizes  the  action-value  function  for  the  lion: 

oaci  =  arg  max((?/(/jon,a)].  (6.3) 

o€4o 

Once  an  overt  action  is  chosen,  it  is  performed  and  the  overt  cycle  begins  anew. 
Figure  6.2  shows  a  cartoon  of  the  decision  system  in  action.  The  large  nodes  rep¬ 
resent  external  world  states,  and  the  arcs  between  them  overt  actions.  Embedded 
within  each  large  node  is  a  subgraph  representing  the  perceptual  cycle.  The  nodes 
in  this  graph  correspond  to  internal  states  seen  by  the  embedded  decision  system, 
and  the  arcs  between  them  correspond  to  perceptual  actions. 

6.1.2  Learning  a  New  Action- Value  Function 

In  Dynamic  Programming  and  Q-learning  the  action-value  of  a  decision,  for  a 
given  policy,  is  defined  as  the  return  the  system  expects  to  receive  given  that  it 
performs  that  decision  and  follows  the  policy  thereafter  (c/.  Equation  3.13).  How¬ 
ever,  for  inconsistent  internal  decisions  this  definition  leads  to  inaccurate  action- 
values  (utility  aberrations).  Meliora-Il  uses  a  modified  learning  algorithm  that 


'Since  it  takes  the  lion’s  sliare  of  credit  or  blame  for  the  agent’s  performance. 


is  based  on  Q-leaming  but  incorporates  a  competitive  component.  This  compo¬ 
nent  tends  to  suppress  the  action-values  of  ina>ndstent  dedaons  while  alloudcig 
action-values  for  condstent  decisions  to  take  on  thdr  nonunal  values.  This  al¬ 
lows  consistent  states  to  be  selected  during  state  identification  and  proddes  the 
overt  cycle  with  action-values  that  accurately  reflect  the  expected  returns  of  the 
underlying  external  decision  problem. 

The  algorithm  for  updating  Qi  is  based  on  the  following  ideas:  1)  the  lion 
state  chosen  to  represent  the  curr-nt  world  state  should  be  consistent  and  its 
actioi-values  should  take  on  theit  nominal  values;  2)  the  action-values  for  other 
intenial  states  in  St  that  might  otherwise  be  used  to  represent  the  current  state  do 
not  need  to  have  their  action-values  updated  as  long  as  the  action-values  for  the 
lion  are  accurate;  3)  any  time  a  decision  is  inconsistent  it  should  be  detected  and 
its  action-value  should  be  suppressed.  Ideally,  the  deci»on  ^'stem  should  learn 
a  new  action-value  function  in  which  the  action-values  of  consistent  decisions 
take  on  their  corresponding  values  for  the  external  task  and  the  action-values  for 
inconsistent  decisions  are  zero: 

a)  =  l  consistent  _ 

^  I  0  otherwise  '  ’  ' 

where  (sejCe)  €  DRep{s^a).  Since  rewards  are  always  positive  in  the  GB-task, 
Q'e  is  uniformly  greater  than  zero.  Thus,  when  using  Q/**'  to  define  its  overt 
policy  Meliora-ll  would  never  base  its  decision  on  an  inconsistent  internal  state 
and  would  always  choose  the  optimal  overt  action.^ 

In  Meliora-Il,  inconsistent  decisions  are  detected  ?s  follows.  If  at  time  i  the 
action-value  for  any  decision  d  =  (s,a)  where  s  €  St,  is  greater  than  the  estimated 
return  obtained  after  one  step,  -t-  ‘yVi{liont+i),  then  the  decision  is  suspected  of 
being  inconsistent  and  its  action-value  is  suppressed  (e.g.,  reset  to  zero).  Actively 
reducing  the  action-values  of  lions  that  are  suspected  of  being  inconsistent  gives 
other  (possibly  consistent)  internal  states  an  opportunity  to  become  lions.  If 
the  lion  does  not  overestimate,  its  action-value  is  updated  using  the  1-step  Q- 
learning  rule.  To  prevent  inconsistent  states  from  climbing  back  into  contention 
and  competing  for  lionhood,  the  estimates  for  non-lion  decisions  in  5<  are  updated 
at  a  lower  learning  rate  and  only  in  proportion  to  the  error  in  the  lion’s  action- 
value.  Also,  any  time  a  decision’s  action-v«ije  overestimates  the  1-step  return, 
it  is  suppressed.  This  algorithm  works  for  external  tasks  that  are  deterministic 
and  have  only  positive  rewards  (e.g.,  the  GB-ttisk).  For  these  cases,  the  property 
that  allows  the  algorithm  to  work  is  that  inconsistent  decisions  will  eventually 
overestimate  their  action -values  (due  to  utility  aberrations).  Thus,  inconsistent 
states  will  eventually  be  suppressed.  On  the  other  hand,  it  can  be  shown  that  a 

‘Notice  tiiat  this  I'efinition  of  optimal  does  not  account  for  the  cost  of  perceptual  actions. 
Here  optimality  is  defm  >d  in  terms  of  the  external  decision  problem. 
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consiftent  lion  is  stable  if  ever}’  external  state  between  the  lion  and  the  goal  (or 
every  state  in  the  limit  cycle  of  the  optimal  policy)  also  has  consistent  licms  trith 
accurate  action-value  estimates.^  Thus,  inconastent  deddons  are  unstable  noth 
respect  to  lionhood  while  consistent  dedsions  eventually  become  stable.  The  steps 
for  updating  action-values  are  shown  in  Figure  6.1  under  the  Update-Overt-Q- 
Estimates  heading. 

6.1.3  The  Perceptual  Subcycle 

The  steps  in  the  perceptual  cycle  are  sketched  in  Figure  6.1  under  the  Percep¬ 
tual  Cycle  heading.  The  objective  of  the  perceptual  cycle  is  to  accumulate  a  set 
of  internal  representations  of  the  external  world,  one  of  which  is  consistent.  This 
goal  is  achieved  by  executing  a  series  of  perceptual  actions.  In  Meliora-Il  the  per¬ 
ceptual  cycle  executes  a  fixed  number  (n  =  4)  of  perceptual  actions.  This  number 
has  proven  adequate  for  the  GB-task,  but  it  is  easy  to  imagine  variable  length  per¬ 
ceptual  cycles  in  which  the  cycle  either  terminates  as  soon  as  a  consistent  internal 
state  is  found  or  increases  when  inconsistent  states  are  encountered.  The  algo¬ 
rithm  for  selecting  actions  within  the  perceptual  cycle  is  similar  to  the  algorithm 
for  choosing  overt  actions  in  the  overt  cycle.  With  probability  ^  (e.g.,  p'  =  0.9), 
the  system  follows  its  perceptual  policy  /p;  otherwise,  it  selects  a  perceptual  action 
at  random.  When  following  policy,  the  action  selected  is  the  perceptual  action 
with  the  maximal  action-value  for  the  currently  perceived  input  vector: 

/p(sp)  =  arg  in^{(?/(sp,  c)]  (6.5) 

where  Sp  is  the  currently  perceived  input  vector. 

The  rules  for  updating  action-values  for  perceptual  actions  are  shown  in  Fig¬ 
ure  6.1  within  the  Perceptual  Cycle  procedure.  These  updating  rules  lead  to 
action- values  that  average  the  overt  utilities  of  the  internal  states  that  result  from 
executing  a  perceptual  action.  Since  consistent  states  tend  to  have  higher  overt 
utilities  than  inconsistent  states  (whose  action- values  are  suppressed),  the  effect 
i'-  to  choose  perceptual  actions  that  lead  to  consistent  internal  states. 


6.2  Performance  Results 

A  series  of  experiments  were  performed  on  the  GB-task  to  evaluate  its  perfoirrance 
quantitatively.  In  each  run,  the  robot  was  sequentially  presented  with  1000  in¬ 
stances  of  the  task  (i.e.,  1000  trials).  As  in  Chapter  5,  each  instance  consists  of  a 
randomly  configured  pile  of  4  blocks,  with  the  pile  always  containing  exactly  one 

®See  [Tan,  1991a]  for  a  nice  analysis  of  this  overestimation  technique. 
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Figure  6-3:  A  plot  of  the  number  of  steps  per  trial  as  a  function  of  the  instances 
seen  b}'  Meiiora-II  for  a  typical  experimental  run.  High  variation  is  due  to  the 
wide  v’ariety  of  tasks  being  solved. 

green  block;  and  if  in  any  trial  the  robot  fails  to  solve  the  problem  after  n^uit  =  30 
overt  actions,  it  moves  on  to  the  next  trial. 

Performance  results  for  a  typical  experimental  run  are  shown  in  Figure  6.3. 
The  graph  shows  the  number  of  overt  actions  taken  by  Meliora-II  for  each  of  the 
1000  instances  of  the  task  it  encounters  duri  ig  a  typical  run.  Initially,  Meiiora-II 
fails  on  almost  every  trial  (i.e.,  it  takes  30  steps  and  quits).  It  does,  however, 
manage  to  solve  a  few  instances.  These  early  successes  are  invariably  easy  prob¬ 
lems,  requiring  only  one  or  two  correct  actions  to  solve.  After  about  100  trials, 
Meliora  begins  to  solve  more  and  more  instances  including  more  difficult  prob¬ 
lems.  Eventually,  it  learns  to  solve  even  the  most  difficult  instances  and  rarely 
fails  (e.g.,  <  bVc  failure  after  1000  trials). 

Meliora’s  performance  on  a  given  trial  depends  strongly  on  the  difficulty  of 
the  trial  instance;  consequently,  the  curve  in  Figure  6.3  has  a  large  variance. 
A  clearer  picture  of  Meliora’s  performance  is  obtained  by  averaging  results  over 
multiple  experimental  runs.  Figure  6.4  plots  the  solution  time  per  trial  averaged 
over  200  runs.  Plots  for  the  optimal  number  of  steps  (average  of  200  runs)  and 
for  an  agent  behaving  randomly  are  also  shown.  The  figure  clearly  shows  that 
Meliora’s  initial  performance  is  poor  —  near  the  maximum  of  30  steps  per  trial 
—  but  improves  considerably  during  the  first  few  hundred  trials.  The  system’s 
performance  settles  at  just  under  12  steps  per  trial  (about  125%  optimal). 

The  system’s  performance  fails  to  converge  to  optimal  for  two  reasons.  First, 
with  probability  1  —  p  (p  =  0.9),  the  decision  system  chooses  its  overt  action  ran- 
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Figure  6.4:  A  plot  of  the  average  number  of  steps  per  trial  as  a  function  of  the 
instances  seen  by  Meliora-II.  The  average  is  taken  over  200  runs  and  provides 
a  smoother  picture  of  the  system’s  learning  curve.  The  system’s  steady  state 
performance  is  approximately  125%  optimal.  Also  shown  is  the  average  number 
of  steps  taken  by  an  agent  acting  randomly. 


domly,  reflecting  a  simplification  in  our  decision  algorithm  that  can  be  eliminat^id 
by  incorporating  more  complex  procedures  for  controlling  exploration  [Barto  et .  J., 
1990;  Kaelbling.  1990].  Second,  the  decision  system  is  not  guaranteed  in  every  case 
to  find  a  consistent  lion  (even  if  it  exists)  since  the  perceptual  subcycle  only  exe¬ 
cutes  4  perceptual  actions  and  chooses  the  lion  from  the  set  of  at  most  five  unique 
internal  states.  Further,  perceptual  actions  are  also  cccasionally  (1  —  p'  ~  0.1) 
selected  randomly.  As  a  result,  residual  inconsistent  lions  occasionally  arise,  and 
interfere  with  the  system's  performance. 

Figures  6.3  and  6.4  show  that  the  system  learns  to  solve  the  task,  but  they 
say  nothing  about  which  instances  Meliora  learns  to  solve  first  or  the  order  in 
which  the  robot  learns  its  task-dependent  representation.  To  get  a  gli  mpse  at 
the  order  in  which  instances  of  the  task  are  learned,  each  problem  instance  was 
classified  into  one  of  four  categories:  easy,  intermediate,  difficult,  and  ver>  difficult. 
Easy  problems  correspond  to  instances  in  which  the  green  block  is  clear  and  the 
robot  need  only  pick  it  up.  Intermediate  problems  include  instances  where  the 
green  block  is  covered  by  one  block;  difficult  problems,  two  blocks;  and  very 
difficult  problems,  three  blocks.  Plots  of  the  average  trial  times  and  average 
success  rate  for  each  of  these  four  classes  of  problems  are  shown  in  Figure  6.5  and 
Figure  6.6,  respectively.  Both  figures  show  that  the  agent  first  learns  to  solve  easy 
tasks  reliably,  and  then  learns  more  and  more  difficult  ones.  In  Figure  6.5,  the 


agent  shows  improvement  on  easy  tasks  immediately;  it  shows  improvement  on 
intermediate  tasks  after  10-20  trials;  on  difficult  tasks  after  50-60  trials;  and  on 
the  most  difficult  tasks  after  70-80  trials  (see  Figure  6.5b).  A  similar  trend  is  seen 
in  Figure  6.6,  which  also  shows  that  the  agent  eventually  learns  to  solve  all  but 
the  most  difficult  tasks  reliably  and  then  only  fmls  about  10%  of  the  time.^ 

To  determine  the  order  in  which  Meliora-II  learns  a  consistent  representation, 
statistics  were  collected  to  measure  the  amount  of  overestimation  that  occurs  dur¬ 
ing  learning.  As  before,  world  states  were  classified  into  four  categories  according 
to  their  distance  to  the  goal:  easy,  intermediate,  difficult,  and  most  difficult.  For 
each  class,  the  fraction  of  times  per  trial  (over  200  runs)  the  lion  overestimated 
(and  was  suppressed)  was  maintained  as  a  function  of  the  number  of  trials  seen. 
These  percentages  are  plotted  in  Figure  6.7.  As  expected,  the  agent  initially  over¬ 
estimates  a  high  fraction  of  the  time.  This  fraction  is  especially  high  because  a 
single  overestimation  can  cau  ie  a  chain  of  subsequent  overestimations;  and  lack¬ 
ing  knowledge  on  how  to  control  perception,  the  agent  frequently  fwls  to  choose 
a  consistent  lion.  With  experience,  however,  the  agent  eventually  learns  to  select 
consistent  internal  states,  and  the  amount  of  overestimation  decreases. 

We  expected  Meliora-II  to  learn  consistent  lions  for  easy  states  first  and  then 
to  boot-strap  its  way  to  consistency  for  more  and  more  distal  states.  To  some 
extent  thi.s  expectation  is  verified  in  Figure  6.7,  which  shows  that  the  amount  of 
overestimation  decreases  first  for  easy  states  and  decreases  later  for  more  diffi¬ 
cult  states.  Early  on,  the  fractions  for  intermediate,  difficult,  and  most  difficult 
problems  are  virtually  indistinguishable,  explained  by  the  fact  that  initially  these 
more  difficult  problems  are  rarely  solved,  and  when  they  are,  they  tend  to  be  inef¬ 
ficient.  For  example,  when  solving  an  intermediate  problem,  it  is  common  for  the 
agent  to  stack  an  extra  block  on  the  green  pile,  try  other  unhelpful  actions  within 
that  configuration  for  a  while,  unstack  the  block,  and  go  on  to  solve  the  problem. 
Thus,  the  agent  sees  mixes  of  intermediate,  difficult,  and  most  difficult  states. 
Initially,  therefore,  all  trials  end  up  visiting  about  the  same  fraction  of  consistent 
states.  This  random  searching  is  much  less  prevalent  in  easy  tasks  whose  solu¬ 
tions  involve  only  one  or  two  correct  actions.  Eventually,  as  the  agent  learns  to 
solve  easy  problems  (after  80-100  trials),  intermediate  states  become  increasingly 
consistent  and  the  agent  visits  harder  states  less  frequently  on  its  way  to  the  goal. 
The  inconsistency  in  intermediate  states  tends  to  decrease  while  the  consistency 
of  more  difficult  states  remains  unchanged. 

Figure  6.7  also  shows  that  after  1000  trials  the  agent  continues  to  overestimate 
a  substantial  fraction  of  the  time.  This  fraction  is  fairly  low  for  easy  problems  (w 
5%)  but  higher  for  the  most  difficult  problems  («  45%).  There  are  three  reasons 
for  this  high  rate  of  overestimation.  First,  as  previously  mentioned,  Meliora-II  is 

^Increasing  n^un  slightly,  say  to  40,  almost  always  gives  the  agent  the  extra  time  it  needs  to 
solve  even  the  most  difficult  problems 
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Figure  6.5:  Plots  of  the  average  number  of  steps  per  trial  for  each  of  the  four  classes 
of  problem  instances;  i)  easy  (no  unstacking);  ii)  intermediate  (1  to  unstack);  iii) 
difficult  (2  to  unstack);  and  iv)  most  difficult  (3  to  unstack),  a)  shows  a  complete 
plot  ranging  from  0  to  1000  trials;  b)  shows  a  focused  plot  ranging  from  0  to  200 
trials.  The  plots  show  that  the  agent  learns  to  solve  easier  tasks  first. 


-  '  y  vy 


b) 


Tl1»IS 


Figure  6.6:  Success  rates  for  each  of  the  four  classes  of  problem  instances  versus 
the  number  of  trials  seen  by  the  agent,  a)  shows  a  complete  plot  ranging  from 
0  to  1000  trials;  b)  shows  a  focused  plot  ranging  from  0  to  200  trials.  The  plots 
show  that  the  agent  learns  to  solve  easier  tasks  first  and  eventually  learns  to  solve 
all  instances  fairly  reliably. 
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Figure  6.7;  The  fraction  of  overestimations  encountered  over  200  runs  for  each  of 
the  four  classes  of  problem  instances.  The  plot  shows  that  consistent  represen¬ 
tations  are  learned  for  easy  problems  first,  followed  by  consistent  representations 
for  more  difficult  problems,  and  that  the  agent  continues  to  perform  in  the  face 
of  residual  inconsistencies  and  overestimation. 

not  guaranteed  always  to  find  a  consistent  internal  state,  even  if  one  exists.  This 
explains  the  small  fraction  of  steady  state  overestimation  that  occurs  even  for 
easy  problems.  Second,  a  single  overestimation  (and  suppression)  tends  to  cause 
a  chain  reaction  of  overestimations  in  earlier  “set-up”  states  (even  for  consistent 
states).  Thus,  the  high  fraction  of  overestimation  in  more  distal  (difficult)  states 
is  explained  by  the  fact  that  occasional  overestimations  in  easy  states  propagate 
back  to  these  states  and  destroy  consistencies  there.  Third,  when  overestimations 
occur  they  tend  to  impair  the  agent’s  decision  policy  temporarily.  Often  the  agent 
will  waste  a  great  deal  of  time  in  an  inconsistent  confused  loop  until  it  gives  up 
or  manages  to  stumble  onto  a  state  from  which  it  can  solve  the  problem.  As 
a  result,  these  statistics  are  misleading  in  that  they  tend  to  report  repeatedly 
overestimations  for  the  same  inconsistent  internal  states. 

The  robustness  of  the  Meliora’s  performance  in  the  face  of  persistent  overes¬ 
timations  led  us  to  consider  tasks  with  more  than  four  blocks.  Another  set  of 
experiments  was  performed  in  which  the  problem  instances  ranged  from  easy  (0 
blocks  to  unstack)  to  most  difficult  (3  blocks  to  unstack).  In  these  experiments, 
however,  additional  outlying  blocks  were  added  to  the  pile.  The  number  of  out¬ 
liers  was  randomly  chosen  between  0  and  20.®  Outliers  interfere  with  the  system’s 
ability  to  learn  the  most  difficult  insteinces  because  the  agent’s  sensory  motor  sys- 


®Subsequent  experiments  with  as  many  as  50  blocks  have  shown  similar  results. 


tern  cannot  distinguish  between  stacks  containing  four  or  more  blocks.  Therefore, 
the  agent  has  no  way  of  distinguishing  (under  any  sensory-motor  configuration) 
states  where  it  has  to  unstack  three  blocks  from  states  where  it  has  to  unstack  4, 
5,  6,  or  more  blocks.  These  states  do  not  have  consistent  internal  representations. 
Results  from  the  experiments  are  shown  in  Figure  6.8.  They  are  comparable  to  the 
results  from  the  etirlier  experiment,  except  with  slightly  longer  average  solution 
times  and  a  slightly  lower  success  rate  (especially  for  the  most  difficult  instances). 
Nevertheless,  even  in  the  face  of  inconsistencies  the  agent  is  capable  of  learning  a 
robust  decision  policy. 


6.3  The  Consistent  Representation  Method 

The  lion  algorithm  used  by  Meliora-II  is  a  specific  instance  of  a  general  technique 
which  we  call  the  Consistent  Representation  (OR)  Method.  The  remainder  of 
this  chapter  outlines  the  CR-method,  describes  the  lion  algorithm  in  terms  of  it, 
considers  alternatives  to  the  techniques  used  in  Meliora  II,  and  describes  other 
recent  work  that  makes  use  of  the  CR-method. 

In  the  CR-method,  control  is  comprised  of  two  stages:  an  identification  stage 
and  an  overt  control  stage  (c/.  Figure  6.2  and  Figure  6.9).  The  objective  of  the 
identification  stage  is  to  generate  a  consistent,  task  dependent  internal  representa¬ 
tion.  This  is  accomplished  by  an  identification  procedure  i  that  executes  a  series 
of  non-invasive  perceptual  actions  that  collect  the  information  needed  to  define  a 
consistent  internal  state.  Once  a  consistent  internal  state  has  been  identified,  an 
overt  control  procedure  b  is  invoked  which  generates  a  single  overt  action.  Both  the 
identification  and  overt  control  procedures  are  adaptive.  The  identification  proce¬ 
dure  is  adjusted  to  eliminate  inconsistent  states  from  the  internal  representation, 
and  the  overt  control  procedure  is  adjusted  to  maximize  future  expected  return. 
Let  Ui  and  Ub  denote  the  procedures  used  to  update  the  identification  and  overt 
control  procedures,  respectively.  A  schematic  diagram  of  the  CR  architecture  is 
shown  in  Figure  6.9. 

The  operations  in  the  decision  cycle  of  a  CR-method  are  as  follows: 

1.  At  time  t,  evaluate  the  identification  procedure  f,  generating  an  internal 
state  s',,  which  represents  the  current  external  state. 

2.  Using  as  input,  evaluate  the  overt  control  procedure  6,  generating  an  overt 
action  a',. 

3.  Perform  o',  and  observe  the  next  internal  state  and  the  reward  rj  ob¬ 
tained. 

4.  Evaluate  u„  updating  i  based  on  the  observations  made  in  Step  3. 


k)  «) 


Figure  6.S;  Performance  plots  for  experiments  that  include  piles  of  up  to  20 
outlying  blocks,  a)  shows  the  average  solution  time  for  each  class  of  problem, 
b)  shows  the  success  rate  for  each  class  of  problem  and  c)  shows  the  fraction 
of  overestimations  observed  for  each  class  of  state.  The  plots  are  comparable 
to  those  in  our  original  experiments  and  show  that  the  agent  can  learn  even  in 
environments  that  it  cannot  consistently  represent. 
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Figure  6.9:  The  basic  architecture  of  a  system  using  the  CR-method.  Control  is 
accomplished  in  two  stages:  an  identification  stage,  followed  by  an  overt  control 
stage.  The  goal  of  identification  is  to  generate  a  consistent,  task  dependent  inter¬ 
nal  state  space.  The  goal  of  overt  control  is  to  maximize  the  future  discounted 
return.  In  the  figure,  t  and  b  represent  the  identification  and  overt  control  proce¬ 
dures,  respectively,  t  and  b  are  both  adaptable,  and  the  algorithms  used  to  update 
them  are  u;  and  m,  respectively. 


5.  Evaluate  ut,  updating  b  based  on  the  observations  made  in  Step  3. 

The  central  idea  behind  the  CR-method  is  that  the  agent  aims  to  learn  and 
base  its  actions  on  an  internal  representation  of  the  task  that  is  consistent.  Each 
internal  state  is  assumed  to  define  an  equivalence  class  of  external  states  that  all 
share  the  same  set  of  action-values.  In  other  words,  each  internal  state  satisfies 
the  Markov  property  with  respect  to  predicting  future  rewards  —  information 
in  addition  to  the  internal  state  does  not  improve  the  agent’s  ability  to  predict 
future  reward.  When  inconsistencies  exist,  they  are  detected  and  eliminated.  This 
consistency  assumption  is  derived  from  our  desire  to  use  existing  reinforcement 
learning  techniques  for  overt  control.  That  is,  if  overt  control  is  to  be  learned  using 
Q-learning  [Watkins,  1989],  the  AHC  algorithm  [Sutton,  1984],  the  bucket  brigade 
[Holland  et  a/.,  1986],  or  other  learning  algorithms  based  on  temporal  differences, 
then  the  internal  representation  generated  by  the  identification  stage  must  be 
consistent.  Our  desire  to  use  reinforcement  learning  for  overt  control  constrains 
the  form  of  the  internal  decision  problem  and  defines  the  requirements  for  the 
identification  stage.  It  also  imposes  requirements  on  the  sensory-motor  system. 
In  particular,  the  sensory  system  must  be  capable  of  providing  at  each  point 
in  time  enough  information  to  identify  a  consistent  internal  state.  If  sufficient 
information  cannot  be  attained  from  immediate  sensor  data,  then  the  sensory 
system  may  need  to  be  augmented  with  some  kind  of  short  term  memory  that  can 
be  used  to  keep  track  of  relevant  information  from  the  past. 

One  might  say  that  with  the  CR-method  we  are  aiming  to  construct  internal 
decision  problems  that  are  Markov.  While  it  is  the  case  that  a  Markov  internal 
decision  problem  has  a  state  space  that  is  consistent,  it  need  not  be  the  case  that 
a  consistent  internal  state  space  defines  a  Markov  decision  problem.  In  particular, 
it  is  possible  for  an  internal  decision  problem  to  have  a  consistent  internal  state 
space  but  have  a  non-Markov  transition  function.  Thus,  the  class  of  decision 
problems  with  consistent  state  spaces  is  a  strict  superset  of  the  class  of  Markov 
decision  problems.  Nevertheless,  it  is  intuitively  correct  to  think  of  an  agent  as 
learning  a  Markov  representation  of  the  external  task. 


6.3.1  Meliora-II  as  a  CR  system 

Meliora-ll,  and  the  lion  algorithm  it  employs,  is  one  instantiation  of  the  CR- 
method.  However,  there  are  many  other  ways  to  implement  the  various  compo¬ 
nents  of  the  architecture  and  these  are  worth  exploring.  In  this  subsection,  we 
discuss  Meliora-Il  from  the  general  perspective  of  the  CR-model  and  discuss  a 
number  of  alternatives.  In  the  next  subsection,  we  briefly  describe  two  other  sys¬ 
tems  [Chapman  and  Kaelbling,  1991;  Tan,  1991a;  Tan,  1991b]  that  also  exemplify 
the  CR-method. 
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Overt  Control  in  Meliora-H 


Prom  the  description  in  Figure  6.1,  it  may  not  be  completely  apparent  that 
Meliora-II  uses  1-step  Q-leaming  for  adaptive  overt  control.  However,  notice  that 
at  each  time  step  a  dngle  internal  state  is  identified  (the  lion)  and  used  to  guide 
decision  making  and  utility  estimation.  When  the  lion  states  do  not  overestimate 
(i.e.,  are  not  suspected  of  being  inconsistent)  thdr  action-values  are  updated  us¬ 
ing  the  1-step  Q-leaming  rule.  Also,  notim  that  the  utility  estimates  used  in  the 
updating  rule  are  just  the  utility  estimates  of  the  lions  (the  identified  states).  Up¬ 
dating  the  action- values  for  non-lions  also  uses  a  modified  1-step  estimator,  but 
is  better  thought  of  as  part  of  the  identification  process.  Thus  in  Meliora-II,  Ub, 
the  updating  algorithm  for  overt  control,  corresponds  to  the  updating  procedure 
used  in  1-step  Q-learning. 

Similarly,  the  overt  control  procedure,  5,  used  in  Meliora-II  corresponds  to  a 
simple  version  of  1-step  Q-learning.  Namely,  with  fixed  probability  p  the  agent 
chooses  an  action  at  random;  otherwise,  it  performs  the  action  with  the  largest 
action-value. 

Alternative  Approaches  to  Overt  Control 

In  general,  any  number  of  reinforcement  learning  algorithms  can  be  used  to  im¬ 
plement  overt  control.  We  used  1-step  Q-learning  because  it  is  simple  and  conve¬ 
nient,  but  other  techniques,  such  as  multi-step  Q-learning  [Watkins,  1989],  AHC 
algorithms  [Sutton,  1984],  bucket  brigade  algorithms  [Holland  et  a/.,  1986],  and 
Interval  Estimation  algorithms  [Kaelbling,  1990],  can  be  used  as  well. 

Identification  in  Meliora-II 

In  Meliora-II,  the  identification  procedure  proceeds  by  executing  a  series  of  per¬ 
ceptual  actions  which  collect  a  set  of  candidate  internal  representations.  The 
perceptual  actions  executed  are  determined  by  a  perceptual  control  policy  which 
maps  internal  states  into  perceptual  actions.  This  policy  is  learned  using  elements 
of  1-step  Q-learning,  However,  instead  of  estimating  future  returns,  the  action- 
values  encode  information  about  the  likelihood  of  generating  consistent  internal 
states.  Perceptual  actions  that  cause  the  system  to  focus  on  relevant  cispects  in 
the  environment  and  lead  to  consistent  internal  states  tend  to  have  larger  action- 
values  than  perceptual  actions  that  focus  attention  on  irrelevant  aspects  of  the 
environment.  Once  a  number  of  candidate  internal  states  are  collected,  the  state 
with  the  largest  action-value  is  selected  to  represent  the  current  external  state, 

A  crucial  operation  in  any  algorithm  based  on  the  CR-method  is  the  identifica¬ 
tion  of  inconsistent  internal  states.  In  Meliora-II,  inconsistent  internal  states  are 
detected  by  monitoring  the  sign  in  the  estimation  error  of  the  1-step  Q-learning 
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Figure  6.1 1:  Inconsistent  decisions  can  be  detected  by  partitioning  the  equivalence 
class  of  a  decision  in  two  and  looking  for  differences  in  the  return  distributions  for 
the  two  subsets.  If  a  partitioning  exists  for  which  the  expected  returns  for  the  two 
subsets  differ,  then  the  decision  is  necessarily  inconsistent.  If  no  such  partitioning 
exists,  then  the  decision  is  consistent. 


the  way  until  a  leaf  is  encountered.  If  every  leaf  node  in  the  tree  is  consistent, 
then  the  tree  represents  a  universal  identification  procedure.  If  not,  the  tree  can 
be  refined  by  detecting  the  inconsistent  leaves  and  refining  them  by  splitting  them 
with  additional  sensing  operations. 

In  Meliora-II  inconsistent  states  are  detected  based  on  overestimation  of  the 
action-value  function.  While  this  technique  is  sufficient  for  the  GB-task,  it  is 
extremely  limited  since  it  works  for  decision  problems  that  are  deterministic  and 
that  have  non-negative  rewards  only.  Fortunately,  there  are  other  techniques  that 
are  more  general  and  probably  more  effective  in  the  long  run. 

By  definition,  an  internal  decision  is  inconsistent  if  any  of  the  decisions  it  rep¬ 
resents  in  the  external  model  have  different  optimal  action-values.  Thus,  one  way 
to  determine  whether  or  not  a  decision  d'  is  inconsistent  is  to  split  its  equivalence 
class,  DRep(d’),  into  two  or  more  subsets  and  look  for  differences  in  the  expected 
returns  of  each  subset.  If  a  partitioning  exists  such  that  the  expected  return 
for  two  subsets  differ,  then  the  decision  is  necessarily  inconsistent.  If  no  such 
partitioning  exists  then  the  decision  is  consistent.  This  idea  is  illustrated  graph¬ 
ically  in  Figure  6.11.  The  basic  idea  is  to  keep  track  of  some  extra  information 
(e.g.,  previous  state  information  or  additional  sensory  input  bits)  that  can  be  used 
to  partition  DRep(d’)  and  then  use  statistical  techniques  to  determine  if  the  ex¬ 
pected  returns  for  these  subsets  are  identical.  There  are  several  statistical  methods 


102 


ORepdr) 


/ 

/ 

/ 

Fl-I  / 

/ 


^  FI  -0 


/ 

/ 
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class  of  a  decision  in  two  and  looking  for  differences  in  the  return  distributions  for 
the  two  subsets.  If  a  partitioning  exists  for  which  the  expected  returns  for  the  two 
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exists,  then  the  decision  is  consistent. 
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then  the  tree  represents  a  universal  identification  procedure.  If  not,  the  tree  can 
be  refined  by  detecting  the  inconsistent  leaves  and  refining  them  by  splitting  them 
with  additional  sensing  operations. 

In  Meliora-II  inconsistent  states  are  detected  based  on  overestimation  of  the 
action-value  function.  While  this  technique  is  sufficient  for  the  GB-task,  it  is 
extremely  limited  since  it  works  for  decision  problems  that  are  deterministic  and 
that  have  non-negative  rewards  only.  Fortunately,  there  are  other  techniques  that 
are  more  general  and  probably  more  effective  in  the  long  run. 

By  definition,  an  internal  decision  is  inconsistent  if  any  of  the  decisions  it  rep¬ 
resents  in  the  external  model  have  different  optimal  action-values.  Thus,  one  way 
to  determine  whether  or  not  a  decision  d'  is  inconsistent  is  to  split  its  equivalence 
class,  DRep(d’),  into  two  or  more  subsets  and  look  for  differences  in  the  expected 
returns  of  each  subset.  If  a  partitioning  exists  such  that  the  expected  return 
for  two  subsets  differ,  then  the  decision  is  necessarily  inconsistent.  If  no  such 
partitioning  exists  then  the  decision  is  consistent.  This  idea  is  illustrated  graph¬ 
ically  in  Figure  6.11.  The  basic  idea  is  to  keep  track  of  some  extra  information 
(e.g.,  previous  state  information  or  additional  sensory  input  bits)  that  can  be  used 
to  partition  DRep(d’)  and  then  use  statistical  techniques  to  determine  if  the  ex¬ 
pected  returns  for  these  subsets  are  identical.  There  are  several  statistical  methods 
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that  can  be  used  to  determine  whether  or  not  the  expected  returns  are  identical. 
The  student’s  T-test  [Lehmann,  1959]  used  by  Chapman  and  Kadbling  [Ch^ 
man  and  Kaeibling,  1991]  is  perhaps  the  most  appropriate  dnce  in  many  cases 
the  return  distribution  is  likely  to  be  <^>proximately  normal.  However,  other  para¬ 
metric  and  nonparametric  tests  [Bradley,  1968:  Randles  and  Wolfe,  1979]  may  also 
prove  useful,  including  techniques  based  on  uncertainty  intervals  [Kyburg,  1991; 
Kaeibling,  1990). 

The  appeal  of  statistical  methods  is  that  they  apply  in  general.  However, 
one  troublesome  assumption  made  by  statistical  methods  is  that  the  underlying 
stochastic  process  is  stationary.  Unfortunately,  in  rdnforcement  learning  the  re¬ 
turn  distribution  is  almost  never  stationary  since  the  agent  is  continually  updating 
its  utility  estimates  and  its  control  policy.  Thus,  detecting  statistical  differences 
in  return  distributions  may  be  troublesome.® 

In  genera!  for  any  random  variable  X,  we  will  say  that  the  internal  decision 
df  =  (s'a')  has  uniform  statistics  (or  is  consistent)  with  respect  to  X  if  any  and 
all  subsets  of  DRep{d')  have  the  same  underlying  distribution  with  respect  to 
X.  Given  this  definition,  a  related  statistical  approach  that  may  be  useful  for 
circumventing  the  non-stationarity  associated  with  return  distributions  is  to  detect 
inconsistent  states  by  enforcing  local  consistency  requirements.  In  particular,  if 
each  internal  decision  is  consistent  with  retjpect  to  the  next  internal  state  and  the 
next  immediate  reward,  then  every  decision  is  necessarily  consistent  with  respect 
to  the  return.  Since  these  local  statistics  (next  state  and  immediate  reward) 
are  stationary,  they  are  likely  to  be  easier  to  estimate  and  compare  accurately. 
This  local  approach  is  equivalent  to  monitoring  the  internal  decision  process  to 
ensure  that  it  is  Markov.  When  a  statistical  test  shows  a  state  to  be  non-Markov 
with  respect  to  immediate  rewards  and  predicting  the  transition  function,  it  is 
inconsistent  and  the  identification  function  must  be  refined. 

A  third  approach  to  detecting  inconsistent  states  that  avoids  statistical  tech¬ 
niques  altogether  is  to  rely  on  feedback  from  external  sources.  For  instance,  if 
the  learning  agent  is  embedded  in  a  cooperative  social  environment  where  it  can 
receive  immediate  feedback  from  an  external  supervisor  or  can  watch  the  behavior 
of  another  skilled  agent,  then  it  may  be  possible  for  the  agent  to  deduce  which 
states  are  inconsistent  by  detecting  variation  in  the  feedback  or  behavior  of  the 
external  agent.  For  instance,  if  an  external  critic  provides  a  signal  indicating  the 
correctness  of  an  action  immediately  and  if  the  signal  varies  for  a  given  internal 
decision,  then  the  decision  must  be  inconsistent  (assuming  the  external  critic  is 
reliable).  Similarly,  if  for  a  given  internal  state  the  agent  observes  a  variation  in 
the  actions  performed  by  a  skilled  role  model,  then  the  state  is  probably  ’’nconsis- 

®Chapman  and  Kaeibling  (Chapman  and  Kaeibling,  1991]  have  successfully  used  the  Student’s 
T-test  on  returr  jistributions  to  detect  inconsistencies  for  a  relatively  complex  task.  Judging 
from  their  success,  non-stationarity  may  be  less  of  a  problem  than  expected.  Certainly  more 
empirical  evidence  is  needed  to  know  fo  sure. 
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tent  (again  assuming  the  role  model’s  behavior  is  reliably  optimal).  These  more 
direct  methods  for  detecting  inconsistent  states  make  strong  assumptions  about 
the  environment  in  which  the  agent  is  embedded,  but  can  substantially  reduce 
the  time  needed  to  learn  consistent  internal  representations.  We  have  successfully 
applied  these  direct  methods  in  the  versions  of  Meliora  described  in  Chapter  7. 


6.3.2  Other  CR-systems 

Two  other  adaptive  control  algorithms  that  are  excellent  examples  of  the  CR- 
method  are  the  G-algorithm  [Chapman  and  Kaelbling,  1991]  and  the  CS-QL  al¬ 
gorithm  [Tan,  1991a;  Tan,  1991b]. 


The  G  Algorithm 

Recent  work  by  Chapman  and  Kaelbling  aims  to  address  the  problem  of  having 
to  generalize  over  state  spaces  generated  by  sensory  inputs  with  excessive  irrele¬ 
vant  detail.  In  their  target  domain  the  sensory  system  generates  more  than  one 
hundred  bits  of  input,  most  of  it  irrelevant.  While  the  large  input  vector  defines 
a  consistent  internal  representation,  the  overwhelming  size  of  the  internal  state 
space  that  results  («  states)  severely  interferes  with  Q-learning  by  making 
too  many  distinctions.  They  call  the  problem  of  filtering  out  the  irrelevant  infor¬ 
mation  the  input  generalization  problem.  Their  approach,  called  the  G-algorithm, 
is  to  incrementally  build  a  decision  (or  classification)  tree,  called  a  G-tree.  The 
G-tree  partitions  the  set  of  possible  inputs  into  a  much  smaller  set  of  internal 
states.  The  leaves  of  the  tree  define  the  internal  states  and  internal  nodes  identify 
relevant  input  bits  (or  tests)  to  perform  during  identification.  The  tree  represents 
a  universal  identification  procedure  in  which,  at  each  time  step,  the  input  vector 
is  classified  by  traversing  the  tree  from  the  root  to  a  leaf  by  following  the  branches 
whose  values  match  the  bits  in  the  input  vector.  Input  bits  that  are  not  explicitly 
tested  in  the  G-tree  are  considered  irrelevant  and  are  ignored. 

The  objective  of  the  G-tree  is  to  define  an  internal  state  space  that  is  consistent. 
Internal  states  that  are  not  consistent  are  detected  and  split  into  two  new  states 
by  adding  relevant  bit  tests  to  the  leaves  in  the  tree.  To  detect  inconsistent  states 
the  G-algorithm  uses  the  Student’s  T-test  on  two  statistics:  the  immediate  reward 
and  the  discounted  future  return.  If  dividing  a  leaf  node  in  two  (by  adding  another 
input  bit  to  the  tree)  leads  to  subnodes  that  are  statistically  different,  then  the 
leaf  is  suspected  of  being  inconsistent  and  split  along  that  bit. 

The  G-algorithm  is  used  to  learn  an  identification  procedure  (i.e.,  the  G- 
tree),  while  another  component  learns  an  overt  control  policy.  For  overt  control, 
Chapman  and  Kaelbling  use  a  variation  on  Q-learning  called  the  Interval  Esti¬ 
mation  (IE)  algorithm  [Kaelbling,  1990).  Using  the  IE  algorithm  along  with  the 
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G-algorithm,  they  have  demonstrated  a  system  that  learns  an  efficient  G-tree  and 
learns  to  solve  a  difficult  sequential  decision  problem. 

The  CS-QL  Algorithm 

A  similar  system  was  independently  developed  by  Tan  [Tan,  1991a;  Tan,  1991b]. 
Like  Chapman  and  Kaelbling’s,  Tan’s  system  uses  an  incrementally  built  decision 
tree  to  classify  situations  into  internal  states.  However,  Tan’s  algorithm,  called 
CS-QL,  is  different  in  several  ways.  First,  instead  of  splitting  nodes  by  bits  in  the 
sensory  input,  nodes  are  split  on  more  general  sensing  operations  that  have  costs 
and  may  be  multi-valued.  Second,  insertion  of  sensing  operations  into  the  decision 
tree  takes  into  account  both  the  uiiliiy  and  the  cost  of  the  operation,  thus  yielding 
a  cost-sensitive  decision  tree.  This  is  an  advancement  over  the  G-algorithm,  which 
does  not  account  for  the  cost  of  perceptual  actions.  In  the  lion  algorithm,  cost- 
sensitive  perception  is  achieved  by  discounting  the  rewards  used  to  update  the 
perceptual  action-value  function.  The  CS-QL  algorithm  uses  tin  overestimation 
technique  similar  to  the  one  used  in  Meliora-II  to  detect  inconsistent  states. 

For  overt  control,  the  CS-QL  algorithm  uses  1-step  Q-learning.  Since  inconsis¬ 
tent  states  are  detected  using  the  overestimation,  CS-QL  is  restricted  to  determin¬ 
istic  tasks  with  non-negative  rewards.  Nevertheless,  Tan  has  demonstrated  the 
algorithm  in  a  system  that  efficiently  learns  to  identify  landmarks  and  navigate 
in  a  simulated  2-D  environment. 


10.5 


7  Cooperative  Mechanisms 


7.1  Introduction 

Reinforcement  learning  involves  a  process  that  searches  the  world  for  states  that 
yield  reward.  But- for  most  real-world  tasks,  the  state  space  is  large  and  rewards 
are  sparse.  Under  these  circumstances  the  time  required  to  learn  an  adequate  con¬ 
trol  policy  can  be  excessive.  The  detrimental  effects  of  search  manifest  themselves 
most  at  the  beginning  of  the  task,  when  lack  of  knowledge  can  lead  to  unbiased 
random  search,  and  in  the  middle  of  a  task,  when  changes  in  the  environment 
invalidate  an  existing  control  policy. 

In  nature,  intelligent  agents  do  not  exist  in  isolation  but  are  embedded  in  a 
benevolent  society  that  guides  and  structures  learning.  Humans  learn  in  rich, 
carefully  structured  environments;  they  learn  by  watching  others,  by  being  told, 
and  by  receiving  criticism  and  encouragement.  Learning  is  more  often  a  transfer 
than  a  discovery.  Similarly,  robots  cannot  be  expected  to  learn  very  much  in 
isolation.  They  must  be  embedded  in  cooperative  environments,  and  algorithms 
must  be  developed  to  facilitate  the  transfer  of  knowledge  among  them.  Within 
this  context,  the  reinforcement  learning  framework  continues  to  play  a  vital  role: 

1.  for  pure  discovery  purposes  —  that  is,  reinforcement  learning  can  be  useful 
for  increasing  the  collective  knowledge  of  a  society  as  a  whole, 

2.  for  refining  and  elaborating  knowledge  gained  from  others, 

3.  for  carrying  on  in  the  absence  of  guidance  from  others  and  for  interpolating 
between  periods  of  interaction,  and 

4.  for  providing  a  simple  signaling  mechanism  (rewards)  for  communicating 
and  transferring  knowledge. 

This  chapter  proposes  two  cooperative  mechanisms  to  reduce  search  and  de¬ 
couple  the  learning  rate  from  state-space  size.  The  first  approach,  called  Learning 
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with  an  External  Critic  (LEG),  is  based  on  the  idea  of  a  mentor  who  watches  the 
learner  and  generates  rewards,  which  signal  the  correctness  of  the  agent’s  most 
recent  action.  This  reward  is  used  temporarily  to  bias  the  learner’s  overt  control 
strategy  and  to  detect  inconsistent  internal  states  more  quickly.  The  second  ap¬ 
proach,  called  Learning  By  Watching  (LBW),  is  based  on  the  idea  that  an  agent 
can  gain  valuable  experience  vicariously  by  relating  the  observed  behavior  of  oth¬ 
ers  to  its  own.  While  LEG  techniques  require  interaction  with  an  external  agent 
that  is  knowledgeable  and  attentive,  LBW  techniques  can  be  effective  even  when 
the  external  agent  is  unskilled  and  unaware  of  the  learner. 

The  chapter  begins  by  developing  and  demonstrating  LEG  and  LBW  algo¬ 
rithms  in  the  context  of  the  GB-task.  Two  new  programs,  Meliora-III-LEG  and 
Meliora-III-LBW  are  described  and  shown  to  outperform  Meliora-II  significantly. 
A  formal  analysis  of  search  is  then  presented.  To  facilitate  the  analysis,  attention 
is  focused  on  a  restricted  (but  representative)  class  of  decision  problems,  called 
homogeneous  problem  solving  tasks.  For  these  tasks,  reinforcement  learning  algo¬ 
rithms  that  rely  on  random  walks  to  solve  problems  initially  are  shown  to  have 
expected  learning  times  that  are  at  least  exponential  in  the  depth  of  the  state 
space.'  For  Q-learning,  true  random  walks  can  be  avoided  by  proper  selection 
of  an  initial  (unbiased)  action-value  function.  In  this  case,  the  expected  learn¬ 
ing  time  appears  to  be  at  least  polynomial  in  the  state  space  depth.  Further 
improvement  can  be  made  by  using  the  LEG  and  LBW  algorithms,  which  have 
expected  learning  times  that  are  at  most  linear  in  the  size  of  the  state  space  and, 
under  appropriate  conditions,  are  independent  of  the  state  space  size  altogether 
and  proportional  to  the  length  of  the  optimal  solution  path. 


7.2  Learning  with  an  External  Critic 


Learning  with  an  external  critic  (LEG)  enhances  the  learning  environment  by 
providing  helpful  hints  that  indicate  the  appropriateness  of  the  robot’s  most  recent 
actions.  LEG  algorithms  achieve  faster  learning  by  eliminating  the  delay  between 
the  performance  of  an  action  and  its  evaluation  (feedback),  thus  facilitating  credit 
assignment.  Depending  upon  the  communication  skills  of  the  learner  and  the 
critic,  a  range  of  LEG  algorithms  can  be  devised  [Whitehead  and  Ballard,  1991b]. 
A  particularly  simple  approach,  called  Binary  LEC  (B-LEG)  is  to  assume  that 
after  each  overt  action  an  external  critic  generates  a  binary  signal,  sig{t),  with 
probability  PcriUc  according  to  the  rule: 


sig{i)  = 


1 


YES 

NO 


if  at  is  optimal 
otherwise 


(7.1) 


*The  depth  of  a  state  space  is  formally  defined  in  Section  7.4;  however,  intuitively  it  corre¬ 
sponds  to  the  ma.ximum  distance  (measured  in  number  of  steps)  between  any  two  states  in  a 
state  space. 
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where  at  is  the  overt  action  performed  hy  the  learner  at  time  t. 

To  evaluate  the  potential  utility  of  this  immediate  feedback  a  new  program, 
Meliora-III-LEC,  was  developed  to  exploit  this  information  for  the  GB-task.  The 
control  algorithm  used  by  Meliora-III-LEC  is  called  the  Biasing  Binary  LEG  (BB- 
LEC)  algorithm. 

7.2.1  Meliora-III-LEC 

In  Meliora-III-LEC,  immediate  feedback  from  the  critic  is  used  to  facilitate  adap¬ 
tation  in  both  overt  control  and  state  identification.  With  respect  to  overt  control, 
the  critic’s  signal  is  used  to  bias  the  robot’s  overt  control  policy.  This  is  espe¬ 
cially  useful  during  the  early  stages  of  learning  since  it  reduces  floundering.  With 
respect  to  state  identification,  variation  in  the  critic’s  feedback  is  used  to  detect 
inconsistent  internal  states  quickly. 

The  decision  algorithm  used  in  Meliora-III-LEC  is  shown  in  Figure  7.1.  Struc¬ 
turally,  it  is  quite  similar  to  the  lion  algorithm  used  by  Meliora-II.  However,  a 
number  of  minor  modifications  have  been  made  to  take  advantage  of  the  informa¬ 
tion  encoded  in  the  external  critic’s  signal. 

Biasing  Overt  Control 

In  order  to  positively  bias  the  robot’s  overt  control  policy,  the  critic’s  signal  is 
converted  into  an  internal  source  of  reward.  At  time  f,  the  robot  generates  an 
internal  reward,  rc(<),  according  to  the  rule: 

’  +Rc  if  sigii)  =  YES 

;'<.(<)=-  -Rf  if  =  NO  (7,2) 

0  otherwise 

where  Rc  is  a  positive  constant.  According  to  this  rule,  positive  and  negative 
internal  rewards  are  generated  whenever  positive  or  negative  feedback  is  returned 
by  the  critic,  respectively.  When  no  signal  is  returned  by  the  critic  the  internal 
reward  is  zero. 

The  reward  received  from  the  environment,  denoted  r,„,  and  the  critic’s  reward 
are  treated  sepa<  itely.  Reward  from  the  environment  is  treated  as  in  Meliora-II 
—  it  is  used  to  jearn  a  specialized  overt  action-value  function,  Reward  from  the 
critic  is  used  to  learn  a  bias  function  B  over  internal  state-action  pairs.  The 
bias  function  -jslimates  the  expected  value  of  the  immediate  reward  received  from 
the  critic  foi  each  state-action  pair.  In  Meliora-III-LEC,  the  values  of  the  bias 
function  are  updated  using  a  simple  temporally  weighted  average: 

<-  (1  -^)5<(st,a()  -f  VtcCO,  (7-3) 
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Overt  Cyclet 


1)  Execute  Perceptual  Cycle  and  generate  a  set  of  internal 
representations  for  the  current  world  state. 

2)  Let  di  =  (sjiOf)  be  the  maximal  internal  decision: 
di  s:  argmax(,.«)5s,x>»o(<?/(<.«)  +  B(s,a)) 

3)  Choose  a  state,  the  lion,  to  represent  the  current  world  state:  Hon  =:  S|. 

4)  Estimate  the  utility  of  the  current  world  state,  st:  Vc(st)  Vi{Hon). 

5)  Execute  Update<Overt«Estimate8  based  on  ri-t,  oacU^i,  /toni.i,  sig{t  1); 

where  rt  is  the  reward  received  at  time  i,  cacti- 1  is  the  last  overt  action  executed, 
/ioni-i  is  the  internal  state  selected  to  represent  the  previous  world  state, 

and  sig{t »  1)  is  the  critic’s  signal  for  time  i  -  1. 

6)  Choose  the  next  overt  action  to  execute: 

With  probability  p  follow  policy  /o(Hon)  =  ai, 

Otherwise  choose  randomly:  oact  —  Ilandom(Ao) 

7)  Execute  oaci  to  obtain  ri,  si9(t),  and  st+i. 

8)  Go  to  1). 

Update«Overt«E8timates; 


1)  Estimate  the  error  in  the  lion’s  action-value:  Ej,#n  (rt-»  +  7Vjb(si))  -  Q/(h'ont.i,oad|-i). 

2)  Update  the  action- value  of  the  Hon: 

If  {Ehon  <  0)  or 

(J3(/ioni-i,ai-i)  >  0  and  sig{i  -  1)  =  NO)  or  {B{Hont-\,at~i)  <  0  and  $ig(t  -  1)  =  YES) 
then  the  lion  is  suspected  of  being  inconsistent,  so  suppress  it:  Q/(/iont-),ood|-))  *-  0.0 
Else  update  it  using  the  standard  l-step  Q-learning  rule; 

0/(/ion,-j,  cacti- 0  —  (?/(/ioni-i,oocl|-i)  +  oE|„„. 

3)  Update  non-lion  internal  states: 

For  each  s  6  5|-i  and  s  ^  Hon,.i  do: 

Let  E,  =  ri-j  +  7V£:(st)  -  Qi(s,oact,.i) 
lf(E,  <0)  or 

(B(s,ai_i)  >  0  and  stg(i  -  1)  =  NO)  or  (fl(s,0|_j)  <  0  and  sig{t  -  1)  =  YES) 
then  s  is  suspected  of  being  inconsistent,  so  suppress  it;  Qi(s,oacU~i)  0.0 
Else  update  it  using  the  lion’s  error: 

Qi{s,oaci,-i)  ^  Qi{s,oaci,-i)  +  a*Ehon  —  where  o'  <  o. 

4)  Update  the  bias  function  B: 

For  each  s  €  S|_i  do:  «—  (1  -  V’)(«iOi-i)  +  V’re(t  -  1). 


Figure  7.1:  An  outline  of  the  decision  procedure  implemented  by  Meliora  III-LEC. 
This  procedure  is  similar  to  the  lion  algorithm,  except  that  reward  from  the  critic 
is  1)  used  to  learn  a  bias  function  B,  which  is  used  to  bias  the  agent’s  overt  control 
policy,  and  2)  used  to  detect  inconsistent  internal  states.  The  Perceptual  Cycle 
is  not  shown  here  since  it  is  identical  to  the  one  described  in  Figure  6.1. 
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where  St  is  the  set  of  candidate  internal  states  returned  by  the  perceptual  cycle 
and  at  is  the  internal  overt  action  executed  at  time  t.  The  learning  rate  parameter, 
tk,  is  a  constant  between  0  and  1.  The  average  is  temporally  weighted  so  that  the 
robot  can  “forget”  old  advice  that  has  not  been  recently  repeated.  Without  it,  the 
robot  has  difficulty  adapting  to  changes  in  the  task  once  advice  is  extinguished  or 
changed. 

The  decision  rule  for  overt  control  is  simple.  At  time  <,  let  d<  =  be  the 
candidate  internal  decision  that  maximizes  the  sum  of  the  action-value  and  the 
bias  function.  That  is, 


d,  argmaX(,.,)g5„,^^(Q/(s,a)  +  B{s,a)].  (7.4) 

In  Meliora-lII,  the  state  identified  to  represent  the  current  external  world,  the  lion, 
is  simply  the  state,  $i,  of  the  maximal  decision,  d/.  Similarly,  the  robot’s  overt 
policy,  foiSt),  is  the  action  at  of  the  maximal  decision.  Using  this  rule,  decisions 
that  have  previously  been  associated  with  positive  feedback  from  the  critic  are 
preferred  over  decisions  that  have  received  no  feedback,  and  decisions  associated 
with  negative  feedback  tend  to  be  suppressed. 

After  executing  an  action  and  observing  the  reward,  next  state,  and  critic’s 
signal  that  result,  the  action-value  function,  Q/,  is  updated  more  or  less  the  same 
as  in  Meliora-Il.  If  a  decision  is  suspected  of  being  inconsistent  it  is  suppressed. 
Otherwise,  if  it  is  the  lion’s  decision,  it  is  updated  using  the  1-step  Q-learning 
rule;  If  not,  it  is  a  non-lion  and  is  updated  based  on  the  lion’s  error.  The  bias 
function  is  also  updated  at  this  time. 

The  discounted  return  and  the  critic’s  immediate  reward  are  estimated  sep¬ 
arately  to  ensure  that  subsequent  extinction  of  the  critic’s  feedback  does  not 
inadvertently  disrupt  learning  of  the  underlying  decision  problem.  If  the  two 
rewards  are  combined  and  used  to  estimate  a  single  action-value  function,  then 
subsequent  extinction  of  the  critic’s  feedback  leads  to  a  reduction  in  the  over¬ 
all  expected  return  and  errors  in  the  action-value  function.  These  errors,  in  turn, 
cause  a  prolonged  period  of  non-optimal  behavior,  while  the  agent  estimates  a  new 
action-value  function  for  the  original,  underlying  decision  problem  (see  (White- 
head  and  Ballard,  1991b]  for  details).  By  separating  the  two  rew'ards,  the  robot  is 
able  to  learn  the  action- value  function  for  the  underlying  decision  problem  directly. 
In  this  case,  the  only  effect  of  feedback  extinction  is  on  the  bias  function,  whose 
values  gradually  decrease  to  zero.  This  allows  the  external  critic  to  terminate 
“programming”  once  the  agent  has  learned  the  task. 


Detecting  Inconsistent  Decisions 

In  Meliora-III-LEC,  inconsistent  decisions  are  detected  both  by  using  the  overes¬ 
timation  technique  described  previously  and  by  using  feedback  from  the  external 
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critic.  This  combined  approach  leads  to  improved  performance  when  the  critic  is 
available  but  also  allows  the  robot  to  fall  back  on  an  effective,  but  slower,  method 
when  immediate  feedback  is  absent. 

To  detect  inconsistent  decisions  using  the  critic,  the  agent  simply  looks  for 
internal  state-action  pairs  that  have  resulted  in  contradicting  signals.  That  is,  if 
executing  the  decision  d  s  (s,  a)  yields  positive  feedback  at  one  point  in  time  and 
negative  feedback  at  another,  then  (assuming  the  critic  is  reliable  and  the  task  is 
stationary)  it  must  be  inconsistent. 

Instead  of  keeping  track  of  all  the  signals  ever  received  for  each  internal  de¬ 
cision,  the  biasing  function  is  used  to  detect  contradicting  signals.  In  particular. 


at  time  f,  for 

each  s  €  5t-i,  the  decision  d 

€  (a,Ot.|)  is  considered  inconsistent  if 

either 

jB(a,a,-i)  >  0 

and 

sig{t  ~  1)  =  NO 

or 

J5{5,<i«-i)  <  0 

and 

- 1)  -  YES. 

In  the  first  case,  the  positive  value  of  the  bias  function  indicates  that  in  the 
past  positive  signals  have  been  received  for  this  decision,  which  contradicts  the 
current  signal.  In  the  second  case,  the  negative  bias  value  indicates  previous 
negative  signals,  contradicting  the  present  positive  signal.  Both  cases  indicate  a 
contradiction. 

T.2.2  Experimental  Results 

Meliora-III-LEC  was  applied  to  the  GB-task  to  test  the  BB-LEC  algorithm  and 
to  determine  the  effect  of  an  external  critic  on  the  learning  rate.  Each  experiment 
consisted  of  100  runs  of  250  trials  each.  To  demonstrate  the  degree  of  thrashing 
that  occurs  in  an  an  unsupervised  system,  the  time  limit,  n|„<,  was  increased  to 
100  steps.  Plots  of  the  average  solution  time  versus  trial  number  are  shown  in 
Figure  7.2.  The  results  shown  in  this  figure  are  for  7  =  0.8,  V*  =  0.6,  a  —  0.2, 
p  =  0.9,  p'  =  0.9,  and  Rc  =  500.  Qualitatively  similar  results  were  obtained 
for  a  wide  range  of  parameter  values.  Each  curve  in  the  figure  corresponds  to 
a  different  rate  of  feedback  from  the  external  critic,  ranging  from  no  feedback 
{Peritic  =  0.0,  Meliora-ll)  to  feedback  after  each  overt  action  {peritic  —  l-O).  The 
figure  clearly  demonstrates  the  performance  improvement  gained  by  addition  of 
an  external  critic.  Even  occasional  feedback  (e.g.,  pcriiie  =  0.2)  has  a  tremendous 
effect . 

To  assess  the  effect  of  the  critic’s  signal  on  stale  identification,  the  average 
number  of  suppressed  decisions  per  trial  is  plotted  versus  trial  number  for  the  GB- 
task  in  Figure  7.3.  During  the  first  few  trials,  the  systems  that  receive  immediate 
feedback  begin  to  learn  the  bias  function  and  have  high  suppression  rates.  Some 
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Figure  7.2:  Plots  of  the  average  solution  time  versus  trial  number  for  Meliora- 
III-LEC  for  Pcritic  in  the  range  (0.0, 1.0).  The  plots  show  that  feedback  from  an 
external  critic  is  extremely  useful  for  improving  the  learning  rate,  even  when  it 
occurs  only  occasionally. 


Figure  7  3:  The  average  (over  100  runs)  of  the  number  of  suppressed  decisions  per 
trial  of  Meliora-III-LEC  for  Pcnuc  €  (0.0, 1.0). 


of  these  suppressions  are  due  to  contradictions  in  the  critic's  agnal;  however,  most 
are  due  to  action- value  overestimations  since  in  these  experiments  all  action-values 
were  initialized  to  1.0.  After  a  dozen  or  so  trials  the  ^tems  with  feedbadc  b^n 
to  perform  nearly  optimally  and  the  suppression  rate  drops  dramatically.  The 
low  steady  state  suppression  rate  can  be  explained  by  noting  that  inconsistent 
decisions  tend  to  have  negative  (or  lower)  bias  values.  This  tends  to  keep  them 
from  competing  for  lionhood  when  consistent  states  are  spuriously  suppressed  in 
the  steady  state. 

In  a  second  experiment,  the  task  was  changed  after  250  trials  so  that  the 
robot  was  rewarded  for  picking  up  a  red  block  (instead  of  a  green  one).  Most  rmn- 
forcement  learning  algorithms  (e.g.,  Q-leaming)  have  a  difficult  time  adapting  to 
such  drastic  changes  since  the  previously  learned  action-value  function  interferes 
with  learning  the  new  task  by  strongly  biasing  the  agent's  behavior  away  from 
the  new  source  of  reward.  Moreover,  the  time  needed  to  “unlearn"  (or  change) 
the  existing  action-value  function  is  substantial  since  changes  to  action-values  are 
incremental  [Whitehead  and  Ballard,  1991b].  However,  as  shown  in  Figure  7.4, 
Meliora-III-LEC  has  little  trouble  adapting  to  the  change.  The  incorrect  existing 
policy  does  not  significantly  interfere  with  learning  in  Meliora  since  quick  suppres¬ 
sion  of  inconsistent  decisions  makes  “unlearning"  almost  immediate.  Moreover, 
the  suppression  rate  at  the  point  of  change  (trial  250)  is  lower  than  the  initial 
suppression  rate.  This  follows  since  most  of  the  useless  decisions  (i.e.,  those  that 
leave  the  external  state  unchanged)  have  already  been  detected  as  inconsistent 
and  suppressed  by  trial  250. 


7.3  Learning  By  Watching 

Another  technique  that  can  be  used  to  improve  the  learning  rate  is  to  gain  addi¬ 
tional  experience  by  observing  and  interpreting  the  behavior  of  others.  We  call 
this  approach  Learning- By- Watching  (LBW). 

The  conceptual  organization  of  an  agent  using  LBW  is  shown  in  Figure  7.5. 
The  agent  has  two  fundamental  modes:  a  performance  mode  and  a  watching 
mode.  In  the  performance  mode,  the  agent  acts  like  any  other  adaptive  system 
(i.e.,  it  observes  the  state,  chooses  and  executes  an  action,  observes  the  outcome, 
and  adapts  its  policy  accordingly).  In  the  watching  mode,  the  situation  is  similar 
except  the  agent  uses  the  behavior  of  another  agent  as  its  source  of  experience. 
Depending  upon  the  mode  of  operation,  different  sensing/behavior-interpretation 
hardware  is  used  to  generate  the  state-action-reward  sequences  used  for  learning. 

In  our  work  with  LBW  we  have  assumed  that  the  learner  can  correctly  rec¬ 
ognize  the  state-action-reward  triple  of  any  agent  it  is  observing.  This  sequence 
is  then  used  for  learning  just  as  if  it  were  the  agent’s  own  personal  experience. 
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a) 


b) 


TVial  Number 


Figure  7.4:  Performance  of  Meliora-III-LEC  in  response  to  a  change  in  the  under¬ 
lying  decision  problem.  In  this  experiment  the  robot  receives  a  reward  for  picking 
up  a  green  block  for  the  first  250  trials  and  a  reward  for  picking  up  a  red  block  for 
the  last  250  trials.  Quick  suppression  of  inconsistent  decisions  allows  Meliora-III 
to  adapt  to  task  changes  faster  than  more  incremental  approaches. 
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Figure  7.5:  The  conceptual  organization  of  an  agent  using  LBW.  In  this  architec¬ 
ture,  observations  of  other  agents  are  used  as  an  alternative  source  of  experience 
for  the  embedded  learner. 


Although  this  assumption  is  overly  amplistic  and  ignores  many  important  is¬ 
sues,  it  is  reasonable  considering  our  objective  —  to  explore  the  potential  bene¬ 
fits  of  integrating  reinforcement  learning  with  more  powerful  cooperative  mecha¬ 
nisms  like  “leaming-by- watching.”  The  general  issue  of  recognizing  and  interpret¬ 
ing  the  behavior  of  others  is  fundamental  to  these  cooperative  mechanisms,  and 
some  promising  early  work  in  this  area  has  been  reported  [Newtson  et  aJ.,  1977; 
Tsuji  et  al,  1977;  Thibadeau,  1986;  Hir»  and  Sato,  1989;  Kautz,  1987;  Kuniyoshi 
et  al.,  1990].  However,  many  important  problems  remcun  unaddressed. 

Depending  upon  the  particular  circumstances  of  the  agent  and  the  environment 
in  which  it  is  embedded,  a  number  of  LBW  algorithms  can  be  devised  and  a  range 
of  performance  improvements  attained.  A  learner  capable  of  observing  a  group 
of  equally  naive  peers  solving  similar  tasks  can  gain  from  its  observations,  but 
not  nearly  as  well  as  when  it  observes  a  skilled  role  model.  Further,  if  a  learner 
“knows”  that  the  agent  it  observes  is  ‘killed,  it  can  exploit  that  knowledge  to  make 
even  better  use  of  its  observations.  We  have  studied  a  range  of  LBW  algorithms 
under  a  variety  of  conditions  in  a  simple  simulated  environment  [Whitehead  and 
Ballard,  1991b].  The  results  of  these  experiments  indicate  that  LBW  techniques 
are  robust  and  effective  for  a  wide  range  of  learning  settings.  With  respect  to 
Meliora  and  tiie  block-stacking  task,  we  have  only  studied  LBW  in  the  context 
of  a  skilled  role  model  that  is  known  to  perform  optimally.  An  LBW  algorithm 
for  exploiting  this  ir '  rmation  and  experimental  results  for  it  on  the  GB-task  are 
described  below. 

7.3.1  Meliora-III-LBW 

Meliora-III-LBW  is  a  program  that  learns  the  GB-task  with  the  help  of  LBW.  The 
program  uses  a  version  of  the  lion  algorithm  that  has  been  modified  to  accommo¬ 
date  observations  of  a  skilled  role  model.  In  this  case,  the  embedded  controller 
uses  two  streams  of  experience  for  learning:  personal  experiences  and  observed  ex¬ 
periences.  Personal  experiences  are  used  for  adaptive  perception  and  action  much 
as  in  Meliora-II.  Observed  experiences  are  treated  somewhat  differently.  With 
respect  to  overt  control,  observed  experiences  are  used  to  learn  a  bias  function 
B  over  state-action  pairs.  This  bias  function  is  used  to  bias  overt  control  just 
like  in  LEG.  With  respect  to  state  identification,  observed  experiences  are  used  to 
identify  potentially  inconsistent  states  mere  quickly. 

There  are  two  modes  of  operation  in  Meliora-III-LBW:  a  performance  mode 
and  a  watching  mode.  The  control  algorithm  used  during  the  performance  mode 
is  shown  in  Figure  7.6.  This  algorithm  is  almost  identical  to  the  lion  algorithm 
used  by  Meliora-II  (Figure  6.1),  except  that  the  choices  of  the  lion  and  the  policy 
action  are  affected  by  the  bias  function.  The  control  algorithm  used  while  in 
the  watching  mode  is  shown  in  Figure  7.7.  In  this  mode,  each  control  cycle 
begins  by  performing  the  perceptual  function  to  collect  a  sequence  of  candidate 
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Overt  Cyde; 


1)  Execute  Perceptual  Cycle  and  generate  5i,  a  set  of  internal 
representations  for  the  current  world  state. 

2)  Let  di  =  (si,ai)  be  the  maximal  internal  decision: 
di  =  argmax(,,„)£5,jc>io[Q/(«.o)  +  B(«,o)] 

3)  Choose  a  state,  the  lion,  to  represent  the  current  world  state:  /ton  =  s/. 

4)  Estimate  the  utility  of  the  current  world  state,  ««:  V£(sj)  ♦-  V/(/ton). 

5)  Execute  Update-Overt*Q*Estimates  based  on  V£(si),  rt-i,  oacif-i,  and  liont-i‘, 
where  rt  is  the  reward  received  at  time  t,  oacU-i  is  the  last  overt  action  executed, 
and  lioitt-i  is  the  internal  state  selected  to  represent  the  previous  world  state, 

6)  Choose  the  ne.\t  overt  action  to  execute; 

With  probability  p  follow  policy  fo(lion)  =  ai, 

Otherwise  choose  randomly:  oact  *-  Random{Ao) 

7)  Execute  oact  to  obtain  a  reward  ri,  and  a  new  world  state, 

8)  Go  to  1). 

Update-Overt-Q-Estimates! 


1)  Estimate  the  error  in  the  lion’s  action-value: 

Ehon  —  (rj-i  -1-  7Ve(s())  -  (3/(/»on,_i,ood,_i). 

2)  Update  the  action- value  of  the  lion: 

if  {.Ehon  <  0) 

Then  the  lion  is  suspected  of  being  inconsistent,  so  suppress  it; 
Qi{liont-i,oactt-\)  *-  0.0 

Else  update  it  using  the  standard  1-step  Q-learning  rule: 

Qj{ltont-i,oactt-i)  ^  (3/(/jon,_i,oad,_i) -b 

3)  Update  non-lion  internal  states: 

For  each  s  6  S<_i  and  s  ^  liout-i  do; 

Let  E,  =  r,_,  -f  7V£;(s,)  -  Q/(s,oac/,-i) 

If  {E,  <  0) 

Then  s  is  suspected  of  being  inconsistent,  so  suppress  it;  Qi{s,o<,ctt-i)  *-  0.0 
Else  update  it  using  the  lion’s  error; 

Qj{s,oacit-i)  *-  Qi{s,oacit-i)A-  a'Ehon  —  where  a'  <  a. 


Figure  7.6:  An  outline  of  the  decision  procedure  used  by  Meliora  III-LBW  in 
the  performance  mode.  This  procedure  is  similar  to  the  lion  algorithm  used  by 
Meliora  II,  except  the  control  policy  is  biased  by  observations  made  during  the 
watching  mode.  The  Perceptual  Cycle  is  not  shown  here  since  it  is  identical  to 
the  one  used  in  Meliora  II. 
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Overt  Cycle: 


1)  Execute  Perceptual  Cycle  and  generate  St,  a  set  of  internal 
representations  for  the  current  world  state. 

2)  Let  d|  =  (${,0})  be  the  maximal  internal  decision: 
di  =  argmax(,,a)e5,x^o[0^(*.«)  + 

3)  Choose  a  state,  the  lion,  to  represent  the  current  world  state:  lion  =  sj. 

4)  Estimate  the  utility  of  the  current  world  state,  «i:  Vi(st)  ♦-  Vi(lion). 

5)  Execute  Update-Overt-Estimates  based  on  VE(Sf),  rt_i,  and  liont-i; 
where  rj  is  the  reward  received  at  time  i,  af_j  is  the  last  overt  action  observed, 
and  liont-i  is  the  internal  state  selected  to  represent  the  previous  world  state, 

6)  Observe  the  overt  action  performed  by  the  role  model,  of 

7)  Go  to  1). 

Update-Overt-Estimates: 

1)  Estimate  the  error  in  the  lion’s  action-value: 

Ehon  (rt-i+^VEist))  -  Qi{liont~i,oactt-i). 

2)  Update  the  action-value  of  the  lion: 

If  {Ehon  ^  0) 

then  the  lion  is  suspected  of  being  inconsistent,  so  suppress  it: 

Ql(liont-i,i-i )  0.0 

Else  update  it  using  the  standard  1-step  Q-learning  rule: 

Qj{liont~i,oacit-i)  —  Qi{Hont-i,oactt-i)  +  aEhon- 

3)  Update  non-lion  internal  states: 

For  each  s  €  St-i  and  s  ^  liout-i  do: 

Let  E,  =  rt-i  -f  7V£:(si)  -  Qi{s,oactt-i) 

If  (E,  <  0) 

then  s  is  suspected  of  being  inconsistent,  so  suppress  it:  Qi{s,oactt-i)  *-  0.0 
Else  update  it  using  the  lion’s  error: 

Qj(s,  oacti-i)  —  Q/(s,  oacit-i)  -h  q' Ehon  —  where  a'  <  a. 

4)  Also  use  inconsistencies  in  the  role-model’s  behavior  to  suppress  inconsistent  decisions: 
For  each  (s,a)  €  Si-i  x  Aq  do: 

If  (o  =  and  B(s,a)  <  0)  or  (a  ^  a®_j  and  B{s,a)  >  0) 

then  (s,a)  is  suspected  of  being  inconsistent,  so  suppress  it:  Q/(s,a)  *-  0.0 

5)  Update  the  bias  function,  B: 

for  all  (s,a)  €  x  Aq  do:  B{s,a)  •h-  (1  -  xl>)B{s,a)  +  xpro(s,a). 


Figure  7.7:  An  outline  of  the  decision  procedure  used  by  Meliora  III-LBW  in  the 
watching  mode.  This  procedure  generates  a  set  of  candidate  internal  states  to 
represent  the  current  situation  facing  the  role  model,  observes  the  role  model’s 
action,  and  updates  the  action- value  function  and  a  bias  function  based  on  the 
results.  An  internal  state  is  considered  inconsistent  if  the  role  model  is  observed 
to  perform  more  than  one  action  for  that  state. 
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representations  for  the  current  situation,  5(.  From  these  candidates,  a  single  lion 
is  identified  to  represent  the  current  state.  The  perceptual  cycle  is  not  shown 
in  the  figure  since  it  is  identical  to  the  one  used  by  Meliora  II.  Next,  instead  of 
selecting  an  action  to  execute,  the  robot  simply  observes  the  action  executed  by 
the  external  role  model,  a®,  and  the  reward  that  results,  r^?  At  this  point,  the 
robot  has  made  an  observation,  Of,  which  consists  of  the  following  information: 
St  —  a  set  of  candidate  internal  states;  a®  —  the  overt  action  command  executed 
by  the  lo'e  model;  and  r®  —  the  reward  received  by  the  role  model  as  a  result  of 
its  action.  This  information  is  used  to  update  the  action-value  and  bias  functions 
as  follows. 

The  bias  function  is  updated  by  increasing  the  bias  of  state-action  pairs  that 
match  the  role  model’s  behavior  and  by  decreasing  the  bias  of  those  decisions  that 
do  not.  At  time  f,  the  bias  function  is  updated  as  follows: 

For  all  (s,  a)  €  Si-\  x  Aq  do: 

B{s,a)  ^  (1  -  V>)5{s,a)-f  0r£,(s,a), 


where, 


\  -ft  otherwise 


and  where  Rg  is  a  positive  constant  and  xj}  is  a.  fixed  learning  rate  parameter 
between  0  and  1. 


The  overt  action-value  function  is  updated  using  the  same  technique  as  in  the 
lion  algorithm.  That  is,  the  lion  is  updated  using  the  1-step  Q-learning  rule  and 
non-lions  are  updated  based  on  the  error  in  the  lion’s  estimate.  As  before,  decisions 
that  are  suspected  of  being  inconsistent  have  their  action-values  suppressed.  In 
Meliora-III-LBW,  inconsistent  decisions  are  detected  by  using  the  overestimation 
technique  and  by  detecting  variations  in  the  role  model’s  choice  of  action  for  a 
given  internal  state.  In  particular,  if  for  a  given  internal  state,  the  role  model  is 
observed  to  perform  two  different  actions,  then  the  internal  state  is  assumed  to  be 
inconsistent.^  The  bias  function  is  used  to  detect  variations  within  internal  states 
as  follows: 


For  all  (s,o)  e  St  x  Ao  do: 

the  decision  (s,  a)  is  suspected  of  being  inconsistent  and  suppressed  if: 

1)  a  =  a®  and  B{s,a)  >  0 
or 


^The  details  of  how  this  recognition  is  performed  are  non-trivial  and  may  be  very  difficult. 
However,  in  our  experiments  these  difficulties  are  ignored. 

®This  method  for  detecting  inconsistent  states  assumes  that  the  role  model  always  performs 
the  same  optimal  action  for  a  given  state,  even  when  more  than  one  exists. 
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2)  a  ^  a\  and  B{s,a)  <  0. 


Both  of  these  cases  correspond  to  situations  where  the  learner  has  previously 
observed  the  role  model  perform  a  different  action  in  a  situation  represented  by 
s. 


7.3.2  Experiments 

To  evaluate  the  utility  of  LBW  for  state  identification  and  overt  control,  Meliora- 
IIl-LBW  was  tested  on  the  GB-task.  In  these  experiments  the  robot  alternates 
between  solving  the  task  itself  and  watching  an  external  role  model  perform  the 
task.  As  before,  experimental  runs  consist  of  250  trials  each.  In  these  experiments 
the  following  parameter  values  were  used:  a  —  0.2,  7  =  0.8,  V’  =  O-fi?  P  —  0-9> 
p'  =  0.9,  Ro  =  500,  and  =  100.  Results  are  shown  in  Figure  7.8,  which  shows 
the  average  number  of  steps  per  trial  and  the  average  suppression  rate  per  trial 
for  Meliora-III-LBW  and  Meliora  II.  The  average  solution  time  curve  for  Meliora- 
III-LBW  shows  only  trials  attempted  by  the  robot.  Both  plots  clearly  indicate 
that  experiences  gained  by  observing  a  skilled  role  model  substantially  improve 
overall  performance. 


7.4  Analysis 

This  section  presents  a  more  formal  analysis  of  search  in  the  unbiased  Q-learning, 
LEG,  and  LBW  algorithms.  Naturally  the  scaling  properties  of  any  reinforcement 
learning  algorithm  strongly  depend  upon  the  structure  of  the  decision  problem  and 
the  details  of  the  algorithm  itself.  To  date,  we  have  been  unable  to  analyze  the 
learning  time  complexity  of  Q-learning,  LEG,  or  LBW  in  general.  However,  results 
have  been  obtained  for  specific  algorithms  on  a  restricted  class  of  deterministic 
decision  problems.  In  particular,  it  is  shown  that  for  a  rather  restricted  (but 
representative)  class  of  generic  decision  problems,  called  homogeneous  problem 
solving  tasks,  the  learning  time  of  a  zero-initialized  'earning  system  scales  at 
least  exponentially  in  the  depth  of  the  state  space.'*  LEG  and  LBW  algorithms 
are  shown  to  have  substantially  better  learning  time  complexities,  in  particular, 
an  LEG  algorithm  is  shown  to  have  an  expected  learning  time  that  is  no  worse 
than  linear  in  the  state  space  size,  and  under  appropriate  conditions  linear  in  the 
length  of  the  optimal  solution  path  (and  independent  of  the  state  space  size). 
Analogous  results  are  obtained  for  T  .RW  algorithms. 

“•a  zero-initialized  Q-learning  system  is  a  system  whose  initial  action-values  are  uniformly 
zero.  Also,  as  will  be  defined  below,,  the  depth  of  a  state  space  corresponds  roughly  to  the 
maximum  number  of  steps  between  two  states  in  a  state  space. 
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a) 


b) 


Figure  7.8:  Plots  of  a)  the  average  solution  time  and  b)  the  average  number  of 
suppressions  per  trial  for  Meliora-III-LBW  and  Meliora-II. 
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The  analysis  begins  by  defining  a  number  of  properties  that  are  useful  for 
characterizing  state  spaces.  These  properties  are  then  used  to  define  the  class 
of  homogeneous  decision  problems.  Next,  a  lower  bound  on  the  expected  learn¬ 
ing  time  for  a  zero-initialized  Q-learning  system  is  derived,  and  its  relevance  to 
Q-learning  in  general  is  discussed.  Following  that,  results  for  LEG  and  LBW 
algorithms  are  presented  and  discussed.  For  the  most  part,  the  proofs  for  the 
theorems  that  follow  are  straightforward.  They  have  been  omitted  for  clarity  of 
presentation,  but  can  be  found  in  Appendix  A. 

7.4.1  Definitions 

Let  us  begin  by  defining  a  number  of  properties  that  are  needed  to  define  the  class 
of  homogeneous  problem  solving  tasks. 

Definition  1  (deterministic  decision  problem)  A  decision  problem  is  deter¬ 
ministic  if  it  can  be  described  by  a  Markov  decision  process  whose  transition  and 
reward  functions  are  true  functions  (i.e,,  deterministic). 

Definition  2  (1-step  invertible)  A  deterministic  decision  problem  is  l-step  in¬ 
vertible  if  every  action  has  an  inverse.  That  is,  if  in  state  x,  action  a  causes  the 
system  to  enter  state  y,  there  exists  an  action  a"*  that  when  executed  in  state  y 
causes  the  system  to  enter  state  x. 

Definition  3  (uniformly  fc-bounded)  A  state  space  is  uniformly  A:-bounded 
with  respect  to  the  state  x  if 

1.  The  maximum  number  of  steps  needed  to  reach  x  from  anywhere  in  the  state 
space  is  k. 

2.  All  states  whose  distance  to  x  is  less  than  k  have  6_  actions  that  decrease 
the  distance  to  x  by  one,  6.^  actions  that  increase  the  distance  to  x  by  one, 
and  b-  actions  that  leave  the  distance  to  x  unchanged. 

3.  all  states  whose  distance  to  x  is  k  have  6_  actions  that  decrease  the  distance 
by  one  and  b-  -}-  6+  actions  that  leave  the  distance  unchanged.^ 

Definition  4  (homogeneous)  A  state  space  is  homogeneous  with  respect  to  the 
state  X  if  it  is  1-step  invertible  and  uniformly  fc-bounded  with  respect  to  x.  In  this 
case,  k  said  to  be  the  depth  of  the  state  space. 

®That  is,  at  the  boundaries,  actions  that  would  normally  increase  the  distance  to  s  are  folded 
into  actions  the  leave  the  distance  unchanged. 
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Definition  5  (polynomial  width)  A.  homogeneous  state  space  (of  depth  k)  has 
polynomial  width  if  the  size  of  the  state  space  is  a  polynomial  function  of  its  depth. 

Definition  6  (problem  solving  task)  A  problem  solving  task  is  a  sequential 
decision  problem  in  which 

1.  each  trial  begins  in  a  designated  start  state  S, 

2.  each  trial  ends  when  the  system  reaches  a  designated  goal  state  G,  or  gives 
up,  and 

S.  the  system  receives  a  non^zero,  positive  reward  only  upon  entering  the  goal 
state.  That  is, 

f  1  if  =  G  . 

*  I  0  otherwise  '  ’  ^ 

Definition  7  (homogeneous  problem  solving  task)  A  task  is  a  homogeneous 
problem  solving  task  if  it  is  a  problem  solving  task  and  its  associated  state  space 
is  homogeneous  with  respect  to  the  goal  state  G. 

Homogeneous  problem  solving  tasks  represent  an  idealization  of  decision  prob¬ 
lems  commonly  studied  in  reinforcement  learning.  For  instance,  properties  such 
as  locality  and  invertibility  of  actions,  and  delayed,  sparse  rewards  are  common 
to  many  sequential  decision  problems.  Other  assumptions,  such  as  the  uniformity 
of  the  state  space,  the  deterministic  effects  of  actions,  and  the  use  of  single  ini¬ 
tial  and  final  states  are  somewhat  restrictive;  however,  they  simplify  the  analysis. 
Some  of  these  restrictions  are  not  strictly  necessary,  but  simplify  exposition.  For 
instance,  many  of  the  results  described  below  2dso  apply  to  stochastic  processes, 
so  long  as  transitions  are  local.  Other  assumptions,  such  as  the  uniformity  of  the 
state  space,  though  unrealistic,  probably  do  not  fundamentally  affect  complexity 
of  a  given  algorithm,  but  are  needed  to  obtain  closed  form  analytical  expressions. 
For  instance,  the  scaling  properties  of  an  algorithm  when  applied  to  non-uniform 
tasks  are  likely  to  match  the  scaling  properties  of  the  algorithm  when  applied  to 
an  analogous  class  of  uniform  tasks.  Also,  uniformity  of  the  state  space  provides 
a  convenient  means  for  clearly  defining  what  it  means  to  scale  a  task,  emd  allows 
us  to  carefully  model  and  evaluate  the  effect  of  structural  bias  in  the  state  space 
(e.g.,  by  manipulating  the  values  for  6_,  and  6^). 

7.4.2  Unbiased  Random  Walks  and  Q-learning 

In  general,  it  would  be  desirable  to  have  bounds  on  the  expected  time  needed  to 
solve  a  problem  initially.  Such  bounds  would  be  useful  since  much  of  the  time 
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spent  learning  is  accounted  for  in  the  first  few  trials  when  the  agent  thrashes 
about  in  search  of  feedback  (reward).  Unfortunately,  even  for  homogeneous  tasks, 
this  search  time  cannot  be  determined  without  knowledge  of  the  learner’s  initial 
parameter  values.  For  instance,  proper  initialization  of  a  Q-learner’s  action-value 
function  will  yield  optimal  performance  from  the  outset.  If  the  initial  parameter 
values  do  not  encode  any  information  about  the  task,  then  the  learner  is  stud 
to  be  initially  unbiased.  In  Q-learning,  initially  unbiased  agents  can  be  obtained 
by  using  constant  initial  action-values  (i.e.,  V(,,B)6Sxi4Q(«*o)  =  C).  Even  under 
these  circumstances  the  initial  performance  of  the  learner  is  difficult  to  quantify 
analytically.  A  special  case  that  is  analytically  tractable  occurs  when  the  initial 
action-value  function  is  uniformly  zero. 

Definition  8  (zero-initialized)  A  Q-learning  system  with  on  action-value  func¬ 
tion  whose  initial  values  are  uniformly  zero  is  said  to  be  zero-initialized. 

In  the  case  of  zero-initialized  Q-learning,  the  agent  performs  an  unbiased  random 
walk  over  the  state  space  until  it  first  encounters  a  non-zero  reward.  Here  we 
assume  standard  semantics  for  Q-learning,  namely  that  the  agent  when  following 
policy  selects  the  action  with  the  largest  action-value  and  breaks  ties  by  randomly 
selecting  one  of  the  maximal  actions.  In  a  zero-initialized  system,  all  actions 
initially  appear  equally  good  (or  bad)  since  they  all  share  the  same  action-value. 
Moreover,  the  estimation  error,  obtained  after  each  step  and  used  to  update  the 
action- value  function,  is  zero  until  the  agent  receives  a  non-zero  reward.  Thus,  in  a 
problem  solving  task,  a  zero-initialized  Q-learner  solves  its  first  task  by  performing 
a  random  walk.®  For  homogeneous  state  spaces,  a  closed  form  expression  can  be 
obtained  for  the  expected  duration  of  this  random  walk. 


Theorem  3  In  a  homogeneous  problem  solving  task,  the  expected  time  needed  by 
a  zero-initialized  Q-learning  system  to  perform  the  random  walk  needed  to  solve 
the  first  trial  is  given  by  the.  expression 


^Indeed,  since  the  reward  received  at  the  end  of  the  task  may  only  be  used  to  update  a  few 
states  (exactly  1  in  the  case  of  1-step  Q-learning),  the  agent  will  perform  a  series  of  shorter  and 
shorter  random  walks,  one  for  each  trial  until  reward  information  gets  propagated  to  more  distal 
states. 
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Figure  7.9;  Search  time  complexity  as  a  function  of  state  space  depth. 
and 

p_  - - ^ - 

6s  +  6+  +  6-  ’ 

and  where  i  is  the  distance  between  S  and  G,  and  k  is  the  depth  of  the  state  space 
(with  respect  to  G). 

Corollary  4  For  state  spaces  of  fixed  width  and  for  f  +  >  1/2,  the  expected  search 
time  is  exponential  in  the  state  space  she. 

Corollary  5  For  state  spaces  of  polynomial  width  and  for  i+  >  1/2,  the  expected 
search  time  is  moderately  exponential  in  the  state  space  size. 

In  Theorem  3,  is  the  probability  that  the  system,  when  choosing  actions 
randomly,  selects  an  action  that  leaves  the  distance  to  the  goal  unchanged,  and 
P+  (and  P.)  is  the  conditional  probability  that  the  system  chooses  an  action 
that  increases  (decreases)  the  distance  to  the  goal,  given  that  it  chooses  one  that 
changes  the  distance.  These  transition  probabilities  capture  the  structural  bias 
inherent  in  the  underlying  state  space.  That  is,  for  some  problems  the  inherent 
structure  of  the  task  is  such  that  it  tends  to  funnel  the  agent  toward  the  goal.  In 
other  cases,  the  structure  of  the  state  space  may  negatively  bias  the  random  walk 
and  substantially  increase  its  expected  duration. 

Figure  7.9  shows  a  series  of  plots  of  expected  solution  time  (Equation  7.6) 
versus  state  space  depth  k  for  i  -  10,  and  P+  €  [0.45,0.55].  When  P+  >  1/2,  the 
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expected  solution  time  scales  exponentially  in  A;,  where  the  base  of  the  exponent 
is  the  ratio  When  P+  =  1/2,  the  solution  time  scales  linearly  in  fc,  and  when 

<  1  /2  it  scales  sublinearly. 

The  case  where  >  1/2  (negative  biasing)  is  important  for  two  reasons. 
First,  for  many  interesting  problems  it  is  likely  that  Pf  >  1/2.  For  exan»,  .e, 
if  a  robot  attempts  to  build  an  engine  by  randomly  fitting  parts  together,  it  is 
much  more  likely  to  take  actions  that  are  useless  or  move  the  system  further  from 
the  goal  than  towards  it.  This  follows  since  engine  assembly  requires  a  fairly 
sequential  ordering.  Similarly,  a  child  can  be  expected  to  take  time  exponential  in 
the  number  of  available  building  blocks  to  build  a  specific  object  when  combining 
them  at  random.  Of  course,  the  state  spaces  for  building  engines  and  assembling 
blocks  are  not  homogeneous,  but  the  negative  bias  inherent  in  their  state  spaces 
are  likely  to  have  similar  exponential  effects  on  agents  that  initially  solve  tasks 
using  random  walks. 

Second,  when  is  only  slightly  greater  than  1/2,  it  doesn’t  take  long  before 
the  exponent  leads  to  unacceptably  long  searches.  Figure  7.9  illustrates  this  point 
dramatically;  even  when  P+  is  as  small  as  0.51  the  solution  time  diverges  quickly. 
When  P+  s=  0.55  (i.e.,  the  system  is  only  10%  more  likely  to  take  a  “bad”  action 
than  a  “good”  one),  the  search  time  diverges  almost  immediately. 

Theorem  3  applies  only  to  zero-initialized  Q-learning  systems  on  homogeneous 
problem  solving  tasks.  When  the  task  is  non-homogeneous,  the  analysis  breaks 
down  because  the  expected  time  needed  to  perform  the  random  walk  is  difficult 
to  analyze.  When  the  initial  action-values  are  non-zero  the  analysis  breaks  down 
because  the  agent’s  behavior  is  no  longer  completely  random.  For  instance,  ini¬ 
tializing  the  action-values  to  a  fixed  positive  constant  yields  a  search  that  is  biased 
towards  exploring  previously  untried  actions.  This  follows  since  in  this  case  each 
time  the  action-value  of  a  state-action  pair  is  updated  its  value  is  reduced.  The 
more  a  state-action  pair  is  executed  the  lower  its  action-value  becomes.  This,  in 
turn,  favors  the  selection  of  actions  that  have  been  tried  less  often  in  the  past, 
resulting  in  more  exploration  than  in  a  pure  random  walk.  Conversely,  initializing 
the  action- values  to  a  fixed  negative  constant  yields  a  search  that  avoids  explo¬ 
ration  and  prefers  to  repeatedly  try  previously  used  actions.  This  follows  since  in 
this  case  each  time  a  state-action  pair  is  applied  its  action- value  is  increased  incre¬ 
mentally  toward  zero.  The  more  a  decision  is  selected,  the  greater  its  action-value 
becomes  and  the  more  likely  it  is  to  be  selected  in  the  future. 

Theorem  3  states  that  for  P4.  >  1/2,  a  zero-initialized  Q-learner  can  be  ex¬ 
pected  to  take  time  exponential  in  the  depth  of  the  state  space  for  the  initial 
solution  in  a  homogeneous  problem  solving  task.  It  is  useful  to  determine  if  sim¬ 
ilar  complexity  results  hold  for  other  initial  action-values.  Since  these  cases  are 
difficult  to  analyze  formally,  this  question  is  addressed  empirically.  In  particular, 
three  separate  1-step  Q-learning  agents  were  applied  to  a  homogeneous  problem 
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solving  task  whose  state  space  was  systematically  scaled  in  dqpth.  The  three 
agents,  named  Q-,  QO,  and  Q+,  had  action>value  functions  that  were  uniformly 
initialized  to  -1.0,  0.0,  and  1.0,  respectively.  The  <  :;ision  task  is  shown  in  Fig¬ 
ure  7.10.  The  goal  state,  G,  is  on  the  far  left.  The  start  state,  5,  is  maintained 
at  a  constant  distance  of  5  steps  from  the  goal,  as  the  state  space  is  scaled  in 
depth,  k.  Because  there  is  a  3:1  ratio  of  actions  that  increase  the  distance  to  the 
goal  to  actions  that  decrease  the  distance  to  the  goal,  the  state  space  is  biased 
against  that  goal  state.  The  parameters  that  characterize  the  state  space  in  terms 
of  Theorem  3  are:  6+  =  3,  6-  =  1,  b-  =  2.  The  depth  of  the  state  space  was 
systematically  scaled  from  k  =  5  to  k  =  100.  For  each  depth,  the  expected  num¬ 
ber  of  steps  needed  to  solve  the  first  trial  was  estimated  by  averaging  the  results 
of  200  runs.^  The  average  first  solution  times  for  QO  and  Q+  are  plotted  versus 
depth  in  Figure  7.11a.  Q-  was  unable  to  solve  the  task  in  a  reasonable  amount 
of  time  under  any  circumstances  since  it  tended  to  get  stuck  in  local  cycles  that 
never  made  progress  toward  the  goal.  The  figure  shows  that  Q+,  mded  by  its 
exploratory  bias,  significantly  outperforms  QO.  However,  it  continues  to  scale 
super-linearly  in  the  depth  of  the  state  space.  To  determine  whether  the  search 
time  scales  exponentially  in  depth,  the  logarithm  of  the  average  solution  time 
is  plotted  versus  depth  in  Figure  7.11b.  This  plot  shows  that,  as  expected,  QO 
scales  exponentially  in  depth,  but  that  Q+  scales  sub-exponentially.  Apparently, 
the  slight  exploratory  bias  caused  by  the  use  of  positive  initial  action-values  is 
sufficient  to  reduce  the  complexity  of  the  initial  search,  even  for  problem  spaces 
that  are  intrinsically  biased  away  from  the  goal.  Nevertheless,  for  practical  pur¬ 
poses,  the  search  time  continues  to  lead  to  intractably  long  learning  times  since 
it  appears  to  be  at  best  polynomial  in  the  depth  of  the  state  space. 

7.4.3  Analysis  of  LEG 

The  principal  idea  of  both  LEG  and  LBW  is  to  reduce  initial  search  by  exploiting 
information  gained  from  external  agents.  LEG  algorithms  depend  on  immediate 
feedback  from  a  knowledgeable  external  critic  to  reduce  feedback  latency,  while 
LBW  algorithms  depend  on  other  agents  for  alternative  (often  highly  biased) 
sources  of  experience. 

The  results  presented  in  this  subsection  focus  on  LEG  algorithms.  In  particu¬ 
lar  it  is  shown  that  in  a  homogeneous  state  space  the  BB-LEG  algorithm  has  an 
expected  initial  search  time  that  is  at  most  linear  in  the  size  of  the  state  space. 
This  upper  bound  is  an  improvement  over  unbiased  Q-learning,  but  still  disap¬ 
pointing  because  of  its  dependence  on  state  space  size.  However,  tighter  upper 
bounds  can  be  obtained  either  by  restricting  the  class  of  state  spaces  further  or 
by  modifying  the  capabilities  of  the  learner.  Under  these  circumstances  bounds 

^After  each  run  the  agent’s  action-value  function  was  reset  to  its  initial  value. 
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Figure  7.10:  The  1-dimensional  homogeneous  problem  soling  task  used  to  assess 
the  effect  of  the  initial  action-value  function  on  initial  search.  The  start  state 
is  maintmned  at  a  constant  distance  from  the  goal,  G,  as  the  depth  of  the  state 
space  is  scaled  from  k  =  5  to  k  =  100. 

on  the  expected  search  time  can  be  obtained  that  1)  depend  only  oh  the  length 
of  the  optimal  solution  path  and  2)  are  independent  of  the  state  space  size. 

Theorem  6  The  expected  time  needed  by  a  zero-initialized  BB-LEC  system  to 
leam  the  actions  along  an  optimal  path  for  a  homogeneous  problem  soltnng  task 
of  depth  k  is  bounded  above  by 

» |5|  •  6  (7-7) 

•‘critic 

where  ^  Ihe  probability  that  on  a  given  step  the  external  critic  provides 

feedback,  |5|  is  the  total  number  of  states  in  the  state  space,  and  b  is  the  branching 
factor  (or  total  number  of  possible  actions  per  state). 

This  upper  bound  is  somewhat  disappointing  because  it  is  expressed  in  terms 
of  the  state  space  size,  |5|,  and  the  maximum  depth,  k.  Our  goal  is  to  find 
algorithms  that  depend  only  upon  task  difficulty  (i.e.,  length  of  optimal  solution) 
and  are  independent  of  state  space  size  and  depth.  Nevertheless,  the  result  is 
interesting  for  two  reasons.  First,  it  shows  that  when  ^  >  1/2,  BB-LEC  is 
'  n  improvement  over  unbiased  Q-learning  since  the  expected  search  time  grows 
at  most  linearly  in  k,  whereas  )-learning  grows  exponentially  for  zero-initialized 
systems  and  apparently  polyno  (.ially  for  systems  with  fixed  positive  initial  action- 
values.  Second,  because  this  upper  bound  is  inversely  proportional  to  Pcritic-> 


a) 


b) 


Figure  7.11:  Average  solution  time  plots  for  QO  and  Q+.  a)  The  average  number 
of  steps  taken  in  the  initial  solution  to  the  task  for  QO  and  Q+  versus  depth. 
The  exploratory  bias  afforded  Q+  by  its  initial  positive  action-values  leads  to 
performance  that  is  significantly  better  than  QO.  b)  The  natural  logarithm  of  the 
average  first  solution  time  versus  depth  for  QO  and  Q+.  The  graph  confirms  the 
exponential  scaling  rate  for  QO  and  shows  Q-f  to  scale  subexponentially  in  depth. 
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theorem  shows  that  even  infrequent  feedback  from  the  critic  is  suflScient  to  achieve 
the  linear  upper  bound.  This  effect  was  observed  in  Meliora-III-LEC,  where  even 
infrequent  feedback  from  the  critic  substantially  improved  performance. 

The  trouble  with  the  BB-LEC  algorithm,  as  described  so  far,  is  that  the  critic's 
feedback  arrives  late.  That  is,  by  the  time  the  learner  receives  the  critic’s  evalua¬ 
tion  it  finds  itself  in  another  (neighboring)  state,  where  the  feedback  is  of  no  value. 
If  the  learner  has  an  efficient  means  of  returning  to  previously  encountered  states, 
it  can  make  better  use  of  the  critic’s  feedback.  This  idea  leads  to  the  following 
results,  which  show  that  under  appropriate  conditions  the  search  time  depends 
only  upon  the  solution  length  and  is  independent  of  state  space  size. 


Theorem  7  If  a  zero-initialized  BB-LEC  system  uses  an  inverse  model  ®  to 
“undo”  non-optimal  actions  (as  detected  based  on  feedback  from  the  external  critic) 
then  the  expected  time  needed  to  learn  the  actions  along  an  optimal  path  for  a  ho¬ 
mogeneous  problem  solving  task  is  linear  in  the  solution  length  i,  independent  of 
state  space  size,  and  bounded  above  by  the  expression 
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*  t. 


(7.8) 


Similarly,  if  the  task  is  structured  so  that  the  system  can  give  up  on  a  trial  that 
it  fails  to  solve  after  a  reasonable  amount  of  time  or  if  the  system  is  continually 
presented  with  opportunities  to  solve  new  instances  of  a  problem,  then  previously 
encountered  situations  can  be  revisited  without  much  delay  and  the  search  time 
can  be  reduced. 


Theorem  8  A  zero-initialized  BB-LEC  system  that  aborts  a  trial  and  starts  anew 
if  it  fails  to  solve  the  task  after  n,  (h,  >i)  steps  has,  for  a  homogeneous  problem 
solving  task,  an  expected  initial  solution  time  that  is  linear  in  i,  independent  of 
state  space  size,  and  bounded  above  by  the  expression 


1 


Ugl. 


(7,9) 


Corollary  9  A  zero-initialized  Q-leaming  system  using  BB-LEC  that  quits  a  trial 
and  starts  anew  upon  receiving  negative  feedback  from  the  external  critic  has  an 
expected  solution  time  that  is  bounded  from  above  by  the  expression 

[p^li-p,)]  *  (f=  +  (1 -/■=)/> J  *  ’■ 

*This  theorem  doe.s  not  account  for  the  time  needed  to  learn  the  inverse  model.  It  assumes 
the  inverse  model  is  known  a  prion. 
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The  crucial  assumption  underlying  these  results  is  that  the  learner  has  some 
mechanism  for  quickly  returning  to  the  site  of  feedback;  however,  for  some  tasks 
an  explicit  mechanism  may  not  be  necessary  to  decouple  search  time  froiri  the 
state  space  size.  In  particular,  if  the  optimal  decision  surface  is  smooth  (i.e., 
optimal  actions  for  neighboring  states  are  similar),  then  action- value  and  bias 
functions  implemented  using  approximation  techniques  that  locally  interpolate 
(e.g.,  CMACs,  or  Neural  Nets)  can  immediately  use  the  critic’s  feedback  to  bias 
the  impending  decision.  Or  alternatively,  if  Q  and  B  are  approximated  using  tech¬ 
niques  that  generalize  non-locally  (e.g.,  classifier  systems  [Holland  et  al,  1986]), 
then  the  critic’s  feedback  can  be  expected  to  transfer  to  other  non-local  situations 
as  well.  Although  not  reflected  in  the  above  theorems,  generalization  techniques 
like  these  probably  will  enable  LEG  to  be  useful  even  when  explicit  mechanisms 
for  inversion  are  not  available. 


7.4.4  Analysis  of  LBW 

LEG  algorithms  are  sensitive  to  naive  critics.  That  is,  if  the  critic  provides  poor 
feedback,  the  learner  will  bias  its  policy  incorrectly.  This  limits  the  use  of  LEG 
algorithms  to  cases  in  which  the  external  critic  is  skilled  and  attentive.  Learning 
By  Watching^  on  the  other  hand,  does  not  necessarily  rely  on  a  skilled,  attentive 
critic.  Instead,  the  learner  gains  additional  experience  by  interpreting  the  behavior 
of  others.  If  the  observed  behavior  is  skilled  so  much  the  better,  but  an  LBW 
system  can  learn  from  naive  behavior  too. 


Theorem  10  For  a  population  of  naive  (zero-initialized)  Q-learning  agents  using 
LBW,  the  expected  time  to  learn  the  actions  along  an  optimal  path  decreases  to 
the  minimum  required  learning  time  at  a  rate  that  is  fl(l/n),  where  n  is  the  size 
of  the  population. 


Without  help  by  other  means,  a  population  of  naive  LBW  agents  may  still 
require  time  exponential  in  the  state  space  depth.  However,  search  time  can  be 
decoupled  from  state  space  size  by  adding  a  knowledgeable  role  model. 


Theorem  11  If  a  naive  agent  using  LBW  and  a  skilled  (optimal)  role  model  solve 
identical  tasks  in  parallel  and  if  the  naive  agent  quits  its  current  task  after  failing 
to  solve  it  in  Uq  steps,  then  an  upper  bound  on  the  time  needed  by  the  naive  agent 
to  first  solve  the  task  (and  learn  the  actions  along  the  optimal  path)  is  given  by 


Uq  +  i. 


(7.11) 
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As  with  the  LEG  results,  Theorem  11  relies  on  the  agent  having  a  mechanism 
for  returning  to  previously  encountered  states.  Intuitively  this  result  follows  since 
when  a  naive  agent  and  a  skilled  agent  perform  similar  tasks  in  parallel,  it  is 
possible  for  the  naive  agent  to  move  off  the  optimal  solution  path  and  find  itself 
in  parts  of  the  state  space  that  are  never  visited  by  the  skilled  agent.  Starting 
over  is  a  means  for  efficiently  returning  to  the  optimal  solution  path.  Agiun,  we 
expect  LEW  systems  to  perform  well  on  tasks  that  have  decision  surfaces  that  are 
smooth  or  that  otherwise  lend  themselves  to  function  approximation  techniques 
that  generalize. 
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8  Limitations  and  Future  Work 


This  chapter  discusses  some  of  the  limitations  of  the  algorithms  described  so 
far  and  points  out  a  number  of  difficulties  that  may  arise  when  applying  these 
techniques  to  more  complex  and  realistic  control  problems.  To  overcome  these 
difficulties  extensions  to  the  previously  described  algorithms  are  proposed.  While 
these  extensions  seem  plausible,  they  are  necessarily  speculative.  To  date  none 
has  been  implemented  or  tested.  Approaches  that  fall  outside  of  the  CR*method 
framework  are  also  considered. 


8.1  Separation  of  Perceptual  and  Overt  Con¬ 
trol 

The  fundamental  assumption  of  the  CR-method  is  that  at  each  point  in  time 
control  is  achieved  be  first  identifying  the  external  state  (state  identification)  and 
then  executing  the  appropriate  overt  action  (overt  control).  To  ensure  that  the 
external  state  does  not  change,  only  perceptual  actions  are  allowed  to  be  executed 
during  state  identification.  This  restriction  has  two  important  consequences:  1) 
it  constrains  the  control  strategy  learned  by  the  agent  to  one  that  separates  per¬ 
ceptual  and  overt  control  control,  which  may  be  suboptimal;  2)  it  limits  the  class 
of  decision  problems  that  can  be  solved  at  all  using  the  CR-method. 

8.1.1  Non-Optimal  Control 

The  CR-method  restricts  the  embedded  decision  system’s  control  strategy  to  one 
that  performs  overt  actions  only  from  consistent  internal  states.  While  this  ap¬ 
proach  may  be  able  to  optimize  the  expected  return  v  ,n  respect  to  this  restricted 
class  of  control  strategies  (i.e.,  strategies  that  first  iuentify  and  then  control  the 
external  state),  such  systems  cannot  learn  optimal  control  policies  for  internal 
decision  problems  that  require  the  execution  of  overt  actions  from  inconsistent 
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Figure  8.1:  A  task  that  CR- methods  cannot  learn  to  perform  optimally,  a)  the 
external  task:  on  half  the  trials  G1  is  in  the  upper  cell  and  G2  is  in  the  lower:  on 
the  other  trials  the  rewards  are  reversed.  An  arrow  on  the  ceiling  of  the  junction 
cell  indicates  the  direction  to  Gl,  the  larger  reward,  b)  the  internal  decision 
problem;  the  inconsistency  in  state  J  can  be  resr  Ived  by  performing  the  ‘’look” 
perceptual  action.  However,  depending  on  the  difference  between  Gl  and  G2  and 
the  cost  of  the  “look”  action,  the  optimal  action  in  state  J  may  be  to  perform  the 
“up”  overt  action  directly. 
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internal  states.  An  example  of  such  a  decision  problem  is  illustrated  in  Figure  8.1. 
The  figure  shows  a  task  in  which  a  control  policy  that  performs  an  overt  action 
from  an  inconsistent  internal  state  outperforms  the  best  policy  that  executes  only 
perceptual  actions  from  inconsistent  internal  states.  The  external  part  of  the 
task  involves  navigating  from  a  start  state  to  one  of  two  possible  reward  sites 
(Figure  8.1a).  Trials  begin  in  state  S  and  end  when  the  agent  receives  a  non-zero 
reward  (either  Gl  or  G2).  On  half  of  the  trials,  the  agent  receives  reward  Gl  upon 
entering  the  upper  cell,  and  G2  upon  entering  the  lower  cell.  On  the  other  half  of 
the  trials,  the  rewards  are  reversed.  At  the  junction  cell  there  is  an  arrow  on  the 
ceiling,  which  indicates  the  direction  to  Gl,  the  larger  reward.  This  information 
is  not  automatically  registered  by  the  agent  but  can  be  obtained  by  performing 
the  perceptual  action  “look”  at  the  junction.  The  corresponding  internal  decision 
problem  is  shown  in  Figure  8.1b.  The  internal  state  J  is  inconsistent.  It  repre¬ 
sents  the  situation  where  the  agent  is  at  the  junction  in  the  external  task  but  is 
not  looking  at  the  arrow.  In  this  state,  the  overt  actions  “up”  or  “down”  have 
inconsistent  effects  that  depend  upon  the  direction  of  the  arrow.  By  performing 
the  “look”  action,  the  agent  can  resolve  the  ambiguity,  reach  a  consistent  inter¬ 
nal  state  (either  J  t  or  J  j),  and  then  perform  an  overt  action  that  maximizes 
the  expected  return.  Depending  on  the  difference  between  Gl  and  G2  and  the 
cost  of  executing  the  perceptual  action  “look,”  a  policy  that  executes  the  overt 
“up”  action  in  state  J  may  on  average  outperform  policies  that  always  resolve 
the  inconsistency.  For  instance,  if  the  difference  between  Gl  and  G2  is  less  than 
the  cost  of  the  “look”  action,  it  is  better  on  average  to  perform  an  overt  action 
directly  from  state  J.  Unfortunately,  systems  based  on  the  CR- method  are  con¬ 
strained  to  control  strategies  that  always  resolve  inconsistencies,  even  if  the  cost 
of  resolution  outweighs  the  benefit.  As  long  as  the  decision  system  is  constrained 
to  execute  overt  actions  only  from  consistent  internal  states,  there  will  be  tasks 
like  this  one  that  the  agent  cannot  perform  optimally.  The  CR-method  does  not 
provide  a  mechanism  for  trading  off  the  cost  of  perception  against  the  benefit  of 
acting  without  knowledge. 


The  Q-CUP  Algorithm 

One  approach  to  overcoming  this  limitation  is  to  abandon  the  distinction  between 
perceptual  and  overt  actions  and  to  take  a  more  careful  look  at  the  failure  of  Q- 
learning  (c/.  Chapter  5).  Recall  that  in  Chapter  5,  it  was  shown  that  inconsistent 
decisions  in^rfere  with  Q-learning  by  having  action-values  that  average  the  ex¬ 
pected  returns  of  the  external  decisions  they  represent.  These  inaccurate  estimates 
(utility  aberrations),  in  addition  to  inaccurately  estimating  the  true  utility  of  per¬ 
forming  a  given  action  at  a  specific  point  in  time,  also  interfere  with  estimating 
the  action-values.  While  it  is  unlikely  that  the  utility  aberrations  for  inconsistent 
decisions  can  be  eliminated,  it  may  be  possible  to  prevent  them  from  interfering 
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with  the  action-value  estimates  of  consistent  decisions.  That  is,  suppose  we  are 
able  to  detect  inconsistent  decisions  (say,  using  the  techniques  described  in  the 
previous  chapters).  Suppose  further  that  standeird  Q-learning  is  used  to  learn  a 
control  policy  for  both  overt  and  perceptual  actions,  except  that  instead  of  using 
a  fixed  updating  rule,  like  the  1-step  corrected  truncated  return,  the  updating 
rule  is  always  based  on  multi-step  estimators  whose  correction  term  is  the  util¬ 
ity  of  a  consistent  state.  That  is,  after  executing  a  given  decision,  rewards  are 
collected  until  the  agent  encounters  a  consistent  state;  at  that  time  a  return  es¬ 
timate  for  updating  the  action-valuf  for  the  earlier  decision  is  constructed  from 
the  accumulated,  appropriately  dis  ounted  rewards  and  the  utility  estimate  for 
the  current  consistent  state.  Under  these  circumstances,  it  should  be  possible 
to  estimate  accurately  the  sampled  average  of  the  action-values  of  the  external 
decisions  represented.  The  action- value  for  an  inconsistent  decision  depends  on 
the  relative  frequency  that  external  decisions  represented  by  that  decision  are 
sampled  (or  encomtered),  and  in  general  the  inconsistent  decision  will  not  accu¬ 
rately  estimate  tiie  action-values  of  the  external  decisions  it  represents.  However, 
the  action-values  for  consistent  decisions  should  be  accurate  with  respect  to  the 
external  decisions  they  represent.  We  call  this  integrated  approach  to  control 
the  Q-CUP  algorithm  (for  Consistent  UPdating).  The  Q-CUP  algorithm  may  be 
able  to  outperform  the  CR-method  on  certain  decision  problems.  For  instance, 
the  Q-CUP  algorithms  should  be  able  to  learn  the  optimal  policy  for  the  task  in 
Figure  8.1,  since  in  this  case,  the  sampled  average  of  the  return  obtained  after 
executing  either  the  “up”  or  “down”  overt  actions  from  state  J  is  greater  than 
the  expected  return  obtained  by  first  looking  at  the  arrow  to  disambiguate  the 
situation.  Also,  notice  that  if  the  difference  between  G1  and  G2  is  large  enough  to 
make  the  “look”  action  worthwhile,  the  Q-CUP  algorithm  should  learn  to  perform 
that  action.  While  the  Q-CUP  algorithm  appears  to  be  able  to  outperform  the 
CR-method  on  the  task  in  Figure  8.1,  it  may  not  always  be  appropriate,  since  the 
policy  it  learns  may  be  unstable  and  its  performance  may  be  erratic.  This  can  be 
demonstrated  by  modifying  the  task  in  Figure  8.1  as  follows: 

1.  Increase  the  difference  between  G).  and  G2  to  some  large  value  (e.g.,  Gl  = 
10,  G2  =  0), 

2.  Adjust  the  distribution  of  trials  s'.)  that  Gl  is  almost  always  found  in  the 
upper  cell  (e.g.,  99%  of  the  time). 

3.  Change  the  dynamics  of  the  problem  so  that  an  incorrect  up  or  down  action 
returns  the  system  to  the  start  state  S. 

Under  these  circumstances,  an  agent  using  the  Q-CUP  algorithm  will  almost  al¬ 
ways  initially  learn  to  execute  the  “up”  action  from  state  J  since  with  high  prob¬ 
ability  this  action  will  be  optimal  for  the  initial  series  of  trials.  However,  at  some 
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point  the  agent  will  encounter  a  trial  where  the  reward  is  in  the  lower  cell.  On 
this  trial,  the  agent  will  get  stuck  in  a  loop,  continually  trying  the  “up”  action 
until  the  action-value  for  the  “up”  action  in  state  J  is  reduced  enough  to  permit 
another  action  or  until  the  agent  selects  an  alternate  action  at  random.  This  os¬ 
cillation  in  the  agent’s  action-value  function  will  tend  to  repeat  indefinitely  and 
its  performance  will  continue  to  be  unnecessarily  long  for  the  uncommon  case  in 
which  the  reward  is  in  the  lower  cell. 

On  the  other  hand,  an  agent  using  the  CR- method  on  this  task  would  have 
an  action-value  function  that  is  more  stable.  While  it  might  be  slightly  less  ef¬ 
ficient  in  the  common  case  (reward  up)  it  would  be  significantly  more  effective 
in  the  uncommon  case  (reward  down).  Indeed,  depending  upon  the  learning  rate 
and  exploration  strategy  used,  an  agent  using  the  CR-method  may  on  average 
outperform  an  agent  using  Q-CUP. 

The  Q-CUP  algorithm  has  not  yet  been  implemented  or  tested,  and  it  is  dif¬ 
ficult  to  speculate  on  its  specific  performance  characteristics.  However,  because 
it  does  not  distinguish  between  perceptual  and  overt  actions,  it  overcomes  one  of 
the  major  limitations  of  the  CR-method.  Whether  it  works  well  in  practice  or  has 
limitations  of  its  own  remains  to  be  seen. 

Neural  Network  Algorithms 

Other  alternatives  to  the  CR-method  that  allow  for  the  free  intermixing  of  overt 
and  perceptual  actions  are  some  of  the  recent  neural  network  algorithms  [Jordan 
and  Rumelhart,  1990;  Schmidhuber,  1990b;  Thrun  and  Moller,  1991;  Nguyen  and 
Widrow,  1989].  In  these  algorithms,  no  explicit  estimate  of  the  future  expected 
return  is  maintained  or  used  to  update  the  controller’s  policy.  Instead,  adapta¬ 
tion  of  the  mapping  between  sensory  inputs  and  control  actions  is  achieved  by 
modifying  the  weights  of  the  network  using  a  gradient  descent  procedure.  In  par¬ 
ticular,  at  each  point  in  time,  the  gradient  of  the  immediately  received  reward 
is  computed  with  respect  to  the  weights  in  the  network  (using  backpropagation). 
These  immediate  gradients,  accumulated  over  time,  are  then  combined  to  yield 
an  approximation  of  the  gradient  for  the  total  discounted  return,  which  in  turn 
is  used  to  adjust  the  network’s  performance.  Since  these  methods  do  not  depend 
upon  explicit  utility  estimates,  as  in  Q-learning,  they  do  not  suffer  as  much  from 
utility  aberrations.  However,  they  do  have  a  number  of  drawbacks  of  their  own. 
First,  since  the  network  is  adjusted  based  on  actual  rewards  received  (and  has 
no  notion  of  an  underlying  Markov  decision  process),  these  algorithms  may  suffer 
from  the  same  instabilities  expected  to  plague  the  Q-CUP  algorithm.  Second,  in 
order  to  perform  credit  assignment  properly  for  temporally  delayed  rewards,  it  is 
necessary  to  perform  backpropagation  steps  that  go  back  in  time.  This  process 
requires  the  system  to  keep  a  history  of  neuron  activations  and  weight  values,  and 
requires  a  differentiable  model  of  the  domain’s  forward  dynamics  (used  during 
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Figure  8.2:  A  navigation  task  that  requires  memory.  In  this  task  the  position  of 
the  reward  is  indicated  by  the  arrow  in  a  signal  tile.  We  assume  the  agent  can 
only  sense  features  of  the  cell  it  currently  occupies.  To  solve  this  task  we  would 
like  the  agent  to  first  determine  and  remember  the  direction  of  the  arrow  and  then 
navigate  to  the  appropriate  location.  The  systems  described  in  this  dissertation 
have  no  means  of  remembering  and  using  any  information  other  than  that  which 
is  immediately  perceivable. 

backpropagation).  Finally,  since  these  algorithms  are  based  on  backpropagation 
(an  approximate  gradient  descent  method),  they  may  get  stuck  in  local  minima 
and  fail  to  converge  on  an  optimal  policy.  Jordan  [Jordan  and  Rumelhart,  1990] 
and  Schmidhuber  [Schmidhuber,  1990b]  have  done  the  most  work  on  these  al¬ 
gorithms.  Schmidhuber,  in  particular,  has  described  an  algorithm  for  recurrent 
networks  that  is  capable  of  learning  decision  tasks  that  are  non-Markov.  However, 
the  results  reported  so  far  are  preliminary  and  involve  very  simple  problems  (quite 
a  bit  easier  than  the  GB-task).  In  their  current  state,  these  algorithms  do  not 
appear  to  be  scalable  to  more  complex  tasks.  Nevertheless,  additional  research 
may  yield  more  powerful  algorithms  based  on  these  direct  gradient  methods. 

8.1.2  Restrictions  on  the  Task  Domain 

A  second  limitation  of  the  algorithms  described  so  far  is  that  they  are  applicable 
only  to  tasks  in  which  every  external  state  can  be  identified  using  perceptual 
actions  only.  This  restriction  excludes  tasks  in  which  the  agent  must  perform 
overt  actions  to  gain  needed  state  information.  For  example,  in  the  block  stacking 
domain,  tasks  that  require  the  agent  to  move  blocks  in  order  to  identify  the 
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external  state  (e.g.,  unstack  a  block  in  order  to  see  what  is  behind  it)  are  not 
permissible.  Navigation  tasks  where  the  robot  cannot  discern  the  external  state 
from  immediately  visible  cues  are  also  beyond  the  scope  of  the  present  algorithms. 
An  example  of  this  type  is  shown  in  Figure  8.2.  In  this  task,  the  robot’s  objective 
is  to  navigate  to  a  cell  that  contains  food  (a  reward).  As  in  the  task  depicted  in 
Figure  8.1,  the  position  of  the  reward  varies  from  trial  to  trial  and  is  indicated 
with  an  arrow  that  is  located  in  one  of  the  cells.  If  the  robot’s  internal  state 
at  each  point  in  time  is  only  determined  by  attributes  of  the  cell  it  immediately 
occupies,  then  a  robot  will  be  unable  to  learn  a  reliable  control  strategy  since  it 
will  be  unable  to  remember  the  direction  of  the  arrow  in  the  signal  cell.  Ideally,  the 
agent  would  execute  overt  actions  that  navigate  to  the  signal,  sense  and  remember 
the  value  of  the  arrow,  navigate  to  the  junction,  recall  the  information  about 
the  arrow,  and  select  the  next  appropriate  move.  Unfortunately,  the  systems 
described  in  this  dissertation  support  neither  the  use  of  overt  actions  for  state 
identification  (navigating  to  the  signal  cell)  nor  any  mechanism  for  recording  and 
recalling  relevant  aspects  of  previously  visited  positions.  Both  of  these  processes 
are  necessary  to  solve  the  task  in  Figure  8.2. 

Even  though  the  Q-CUP  algorithm  allows  for  the  execution  of  overt  actions 
in  inconsistent  states,  the  Q-CUP  algorithm  will  also  fail  to  solve  this  task  with¬ 
out  any  means  for  explicitly  remembering  the  value  of  the  signal  arrow  once  the 
robot  leaves  the  signal  cell.  A  simplifying  assumption  implicit  in  all  of  the  algo¬ 
rithms  described  so  far  is  that  the  internal  state  space  is  defined  solely  in  terms 
of  the  agent’s  immediate  precepts.  The  agent  is  totally  situated  in  that  after 
each  overt  action  its  internal  state  is  essentially  flushed  and  reacquired  based 
only  on  immediate  sensory  inputs.  Clearly,  this  is  a  very  restrictive  assumption. 
Unfortunately,  very  little  is  currently  known  about  how  to  extend  these  state 
space  models  efficiently  to  include  features  of  past  situations.  This  limitation 
represents  an  exciting  challenge  to  this  approach  to  adaptive  perception  and  ac¬ 
tion.  Recent  work  on  learning  finite  state  automata  by  exploration  may  provide 
a  reasonable  starting  point  for  attacking  this  problem  [Rivest  and  Schapire,  1987; 
Mozer  and  Bachrach,  1989].  Approaches  that  distinguish  internal  states  based  on 
differences  in  the  agent’s  transition  history  (McCallum,  1991]  and  approaches  that 
combine  the  CR-method  with  buffered  sensory  inputs  or  deictic  memory  modules 
may  also  yield  useful  algorithms. 

Reinforcement  learning  algorithms  based  on  recurrent  neural  networks  offer 
another  possible  approach  to  this  problem  since  the  output  of  a  recurrent  net¬ 
work  at  a  given  point  in  time  can  depend  upon  the  total  sequence  of  inputs  it 
has  received  up  to  that  time  [Schmidhuber,  1990c;  Williams  and  Zipser,  1988; 
Simard,  1991;  Pineda,  1987].  In  principle,  these  networks  can  learn  to  encode  and 
remember  relevant  past  information  in  their  hidden  units  as  needed  to  learn  an 
optimal  control  policy.  As  of  this  time,  I  have  not  experimented  with  adaptive 
controllers  based  on  recurrent  neural  networks.  However,  in  light  of  their  ten- 
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dency  to  get  stuck  in  local  minima,  it  is  doubtful  that  ensting  algorithms  can 
be  used  to  solve  the  task  in  Figure  8.2.  Nevertheless,  continued  experimentation 
with  recurrent  neural  networks  for  adaptive  perception  and  action  is  certainly 
warranted. 


8.2  State  Space  Size 

Another  issue  that  requires  further  examination  is  the  rate  at  which  the  size  of 
consistent  state  sp<ices  grows  as  the  complexity  of  a  decision  task  is  increased.  For 
example,  when  a  task  can  be  accomplished  in  several  different  ways  (any  of  which 
might  be  optimal),  the  size  of  the  minimal  consistent  representation  for  that  task 
tends  to  grow  as  the  product  of  the  sizes  of  the  state  spaces  needed  to  represent 
each  of  the  individual  approaches  to  the  task.  An  example  from  the  block  stacking 
domain  illustrates  the  problem  more  clearly. 

Recall  that  for  the  GB-ttisk,  each  pile  contains  exactly  one  green  block.  This 
restriction  was  imposed  to  ensure  that  the  robot’s  two  marker  sensory-motor  sys¬ 
tem  is  adequate  to  register  all  the  information  necessary  to  represent  the  task 
consistently.  If  two  or  more  green  blocks  are  allowed  in  the  pile,  then  a  consistent 
representation  (with  respect  to  picking  up  a  green  block)  must  encode  informa¬ 
tion  about  the  amount  of  work  needed  to  clear  and  pick  up  each  individual  green 
block.  That  is,  to  solve  the  task  optimally,  the  robot  must  analyze  each  green 
block  to  determine  which  one  is  easiest  to  pick  up.  For  the  GB-task  (with  one 
green  block)  the  attention  frame  marker  is  used  to  register  the  number  of  blocks 
above  the  green  block,  and  each  state  in  the  consistent  internal  representation 
is  associated  with  a  unique  stack  height.  For  a  GB-task  with  multiple  blocks,  a 
marker  is  needed  to  register  the  stack  height  of  each  green  block  in  the  pile,  and 
each  consistent  internal  state  represents  a  unique  combination  of  stack  heights 
—  resulting  in  a  state  space  that  is  roughly  an  n-fold  product  of  the  state  space 
for  the  single  block  GB-task  (where  n  is  the  number  of  green  blocks  in  the  pile). 
This  stack  height  information  is  needed  to  consistently  represent  the  state  of  the 
external  world.  Failing  to  mark  and  register  even  one  green  block  opens  the  door 
for  inconsistencies  since  that  unmarked  green  block  may  be  the  eeisiest  one  to  pick 
up.^ 

Similar,  but  even  more  dramatic  growth  in  the  state  space  size  occurs  for  more 
complex  tasks.  For  instance,  suppose  the  robot  receives  a  reward  for  assembling 

'Strictly  speaking,  the  minimal  consistent  internal  representation  need  not  encode  the  stack 
heights  of  each  and  every  green  block.  Actually,  only  knowledge  of  the  easiest  green  block,  its 
stack  height,  and  some  hand  information  is  needed  to  define  the  consistent  states.  However, 
in  order  to  determine  the  easiest  block  to  pursue,  the  agent  must  compare  the  stack  heights  of 
each  of  the  green  blocks,  and  given  the  organization  of  the  sensory-motor  system  this  comparison 
requires  marking  and  registering  the  stack  heights  of  each  block. 
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configurations  of  several  blocks.  Take  the  stack  red-on-green-on-blue,  for  example. 
In  this  case,  to  perform  optimally  the  robot  must  consider  all  possible  combina¬ 
tions  of  red,  green,  and  blue  blocks  that  could  be  used  to  construct  the  desired 
stack.  For  optimal  control,  it  is  not  sufficient  to  find  just  any  red,  green,  and  blue 
blocks;  all  combinations  must  be  considered  and  information  about  all  red,  green, 
and  blue  blocks  must  be  registered. 

Large  state  spaces  also  result  when  there  are  multiple  sources  of  reward.  A 
robot  that  is  rewarded  for  picking  up  a  green  block  or  for  assembling  the  stack  red 
on  blue  must  consider  which  one  of  the  subtasks  to  pursue.  A  rat  that  receives 
rewards  for  satisfying  its  diverse  bodily  needs  (feeding,  drinking,  resting,  mating, 
etc.)  must  in  general  consider  the  appropriateness  of  pursuing  each  activity  in 
light  of  the  current  situation,  and  consider  the  effect  of  pursuing  one  activity  on 
its  ability  to  satisfy  other  needs  later. 

In  all  of  these  examples,  large  state  spaces  are  needed  to  guarantee  that  the 
agent  has  sufficient  information  to  perform  optimally  these  multifaceted  tasks. 
Unfortunately,  these  large  state  spaces  1)  put  tin  unrealistic  burden  on  the  sen¬ 
sory  system  (to  track  and  monitor  continually  every  potentially  relevant  object) 
and  2)  lead  to  slower  learning  of  each  individual  subtask  by  inhibiting  passive 
abstraction/generalization  with  an  overly  large  state  space. 

One  approach  to  reducing  this  scaling  problem  is  to  introduce  additional  struc¬ 
ture  into  the  decision  system  that  partitions  the  overall  decision  problem  into  its 
constituent  pieces.  That  is,  instead  of  using  a  single  monolithic  internal  represen¬ 
tation  and  a  single  monolithic  policy  function,  the  decision  system  is  composed  of 
a  set  of  modules,  each  of  which  is  responsible  for  learning  to  represent  and  perform 
a  single  subtask  of  the  overall  decision  problem.  We  call  one  of  these  modules  a 
schema.  In  the  case  of  a  block  stacking  task  where  multiple  configurations  yield  a 
reward  (e.g.,  holding  a  green  block  or  building  a  red-on-blue  stack),  each  schema 
would  learn  to  represent  and  perform  consistently  one  of  the  subtasks. 

Roughly,  each  schema  would  correspond  to  a  complete  decision  system  in  itself. 
It  would  consist  of  an  identification  routine  and  an  overt  control  component;  it 
would  generate  and  maintain  its  own  internal  representation;  and  it  would  use  the 
same  algorithms  described  previously  to  adapt  itself.  The  key  difference  is  that 
each  schema’s  adaptation  would  be  based  on  a  single  (restricted)  type  of  reward 
(i.e.,  the  reward  associated  with  the  subtask  it  was  performing).  This  reward 
type  could  be  identified  either  a  priori  (say  through  an  innate  set  of  different 
reward  types  such  as  hunger- reward,  thirst-reward,  etc.)  or  by  a  set  of  specific 
learned  conditions  that  classify  the  reward  (e.g.,  reward  consistently  tissociated 
with  holding  a  green  block).  In  any  case,  the  internal  representation  learned  would 
be  aimed  at  consi$fe77cy  with  respect  to  a  specific  narrow  class  of  reward.,  and  the 
control  policy  would  aim  to  maximize  the  accumulation  of  the  specific  reward 
only. 


Another  important  aspect  of  schemas  would  be  the  explidt  incorporation  of 
variables  (or  entities)  which  could  be  bound  to  spedfic  objects  in  the  external 
world  and  used  to  instantiate  different  instances  of  a  schemcu  For  instance,  a 
single  schema  could  be  learned  to  control  the  process  of  picking  up  a  single  green 
block.  This  scheme  might  have  a  single  variable,  which  would  be  bound  to  the 
specific  green  block  to  be  picked  up.  At  any  point  in  time,  the  utility  values  of 
the  internal  state  of  this  schema  would  represent  the  utility  of  the  external  world 
with  respect  to  picking  up  the  green  block  bound  to  the  schema.  Given  a  single 
schema  of  this  type,  the  agent  could  then  perform  a  series  of  “bind-and-evaluate” 
operations  to  search  for  the  best  green  block  to  pursue  when  more  than  one  was 
present. 

In  general,  given  a  set  of  schemas  and  their  instantiations,  overall  control  could 
be  achieved  by  first  evaluating  the  state  of  the  external  world  with  respect  to  each 
instantiation  and  then  choosing  to  follow  the  actions  dictated  by  the  schema  with 
the  Icirgest  utility. 

The  major  disadvantage  of  this  modular  appro2w:h  is  that  it  may  lead  to  non- 
optimal  control.  That  is,  by  greedily  performing  the  schema  with  the  largest 
utility,  the  agent  may  fail  to  maximize  its  long-term  reward  since  it  may  be  possible 
that  execution  of  another  schema,  which  itself  yields  a  lower  immediate  return,  sets 
up  a  third  schema  that  yields  a  larger  reward.  If  the  cumulative  return  of  these  two 
schemas  is  greater  than  the  return  achievable  by  the  single  maximal  schema,  then 
non-optimal  performance  results.  For  these  cases,  more  global  algorithms  (like  a 
monolithic  decision  system)  that  consider  the  utilities  of  performing  combinations 
of  tasks  would  perform  better.  However,  in  many  ceises,  ?  greedy  algorithm  may 
suffice. 

One  of  the  major  advantages  of  the  proposed  modular  approach  is  that  it 
reduces  the  overall  size  of  the  internal  state  space  and  policy  function.  For  a 
monolithic  decision  system  the  state  space  (and  the  domain  of  the  decision  policy) 
has  a  size  that  scales  cis  the  cross  product  of  the  sizes  of  each  individual  subtask, 
vhereas  the  modular  approach  has  a  total  space  requirement  that  scales  linearly 
in  the  number  of  subtasks  (i.e.,  the  total  size  is  the  sum  of  the  individual  state 
spaces). 

Also,  the  total  sensing  requirements  of  the  agent  can  be  reduced  by  committing 
to  the  complete  execution  of  a  given  schema.  That  is,  at  the  beginning  of  a  taisk, 
the  agent  could  evaluate  the  utility  of  pursuing  each  schema  and  select  the  best  one 
for  execution.  If  the  agent  then  commits  to  performing  that  activity,  it  no  longer 
needs  to  perform  subsequent  sensing  operations  to  evaluate  the  other  schema.  In 
general,  any  number  of  high  level  control  strategies  can  be  used  to  trade  off  sensing 
and  overt  performance. 

A  third  advantage  of  this  modular  approach  is  that  the  time  needed  to  learn 
individual  schemas  should  be  substantially  reduced  compared  to  the  learning  time 
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for  a  monolithic  decision  system.  This  follows  since  the  individual  internal  state 
spaces  needed  to  represent  each  subtask  are  significantly  smaller  than  the  state 
space  needed  to  represent  consistently  the  conjoined  monolithic  task. 

To  date,  we  have  done  very  little  work  on  developing  this  schema-based  control 
architecture  other  than  to  recognize  the  need  for  it  and  speculate  on  how  schemas 
could  be  learned  and  controlled.  Although  many  issues  remmn  unresolved,  we  are 
confident  that  a  modular  approach  of  this  type  will  be  useful  for  managing  the 
state  space  size  and  the  learning  rate  as  task  complexity  is  increased. 


8.3  Other  Areas  for  Future  Research 

There  are  a  number  of  other  topics  for  future  research  that  lie  on  the  critical  path 
between  the  ideas  proposed  in  this  dissertation  and  their  useful  application  to  real 
systems. 

The  study  of  active  perception  is  in  its  infancy  and  is  an  area  that  needs  fur¬ 
ther  development.  The  simulated  sensory-motor  system  used  in  Meliora  is  overly 
simplistic,  unrealistic,  and  specifically  designed  to  match  the  needs  of  the  GB- 
task.  The  deictic  sensory-motor  systems  used  by  Agre  [Agre,  1988]  and  Chapman 
[Chapman,  1990b],  though  more  sophisticated  and  more  general  purpose,  are  also 
simulated.  These  principles  of  active  perception  need  to  be  validated  on  real 
physical  systems.  Experimentation  on  real  systems,  in  addition  to  validating  the 
feasibility  of  active  perception  and  deictic  representations,  is  also  likely  to  sug¬ 
gest  new  visual  analysis  operations  and  indexing  strategies  that  might  have  been 
overlooked  otherwise  [Swain,  1990;  Wixson,  1991]. 

It  is  also  necessary  to  develop  more  efficient  implementations  for  the  CR- 
method.  Meliora  used  tables  to  implement  the  various  algorithms  described  above. 
While  algorithms  based  on  tables  are  convenient  for  exposition  purposes  and  easy 
to  implement,  they  require  too  much  space  for  larger  scale  tasks.  Implemen¬ 
tations  based  on  neural  networks  or  other  more  concise  function  approximation 
techniques  need  lo  be  explored.  Also  Meliora’s  internal  state  space  vfas  defined 
precisely  by  the  input  vector  generated  by  the  sensory-motor  system.  When  this 
vector  contains  redundant  or  useless  information,  it  introduces  unnecessary  dis¬ 
tinctions  between  states  and  increases  the  size  of  the  internal  representation.  An 
approach  that  combines  the  perceptual  control  process  used  in  Meliora  and  the 
input  generalization  techniques  described  by  Chapman  and  Kaelbling  [Chapman 
and  Kaelbling,  1991]  or  Tan  [Tan,  1991b]  could  lead  to  even  more  compact  and 
task-specific  internal  representations. 

Also,  if  reasonable  learning  rates  are  to  be  achieved,  the  cooperative  mecha¬ 
nisms  described  in  Chapter  7  (or  other  algorithms  like  them)  must  be  developed 
further.  Techniques  for  recognizing  and  interpreting  the  behavior  of  others  is  cen- 


tral  to  this  endeavor.  More  powerful  communication  and  signaling  protocols  can 
also  be  expected  to  facilitate  learning. 

Finally,  perhaps  the  most  important  work  that  can  be  done  to  validate  and 
advance  the  ideas  presented  here  is  to  apply  them  to  the  control  of  a  real  physical 
system  —  perhaps  a  real  block-stacking  robot,  or  a  robot  that  learns  other  chil¬ 
dren’s  games  (e.g.,  piece-in-hole  games).  Building  such  a  system  would  no  doubt 
uncover  any  number  of  assumptions  and  difficulties  that  have  been  unknowingly 
abstracted  out  of  the  formal  model  and  our  simulated  environment.  Just  a  few 
issues  that  are  likely  to  arise  include:  the  weaknesses  of  discrete  state-time  con¬ 
trol  models,  the  need  for  memory  and  state,  the  need  for  hierarchical  structures 
to  organize  both  the  fine  grained  and  large-scale  structure  of  realistic  tasks,  and 
the  need  to  perform  certain  actions  simultaneously. 


9  Conclusions 


Control  is  the  process  of  directing  interaction  with  an  external  environment  in 
order  to  bring  about  some  desirable  outcome.  This  process  involves  both  sensing 
and  action  —  sensing  to  gain  information  and  action  to  effect  change.  In  both 
classical  control  theory  and  AI  it  is  common  to  begin  by  assuming  that  the  control 
system  has  a  fixed  set  of  sensors  that  provide  it  with  all  the  relevant  information 
needed  for  decision  making.  In  AI,  it  is  common  to  assume  the  existence  of  an 
objective  internal  representation  that  uniquely  labels  and  describes  properties  of 
all  of  the  potentially  relevant  objects  in  a  task  domain.  While  this  assumption  may 
be  reasonable  for  tasks  that  are  narrow  in  scope  or  well  understood  in  advance, 
it  is  unrealistic  for  more  complex  tasks  that  have  diverse  sensory  requirements  or 
that  are  poorly  understood  ahead  of  time.  Under  these  circumstances  systems 
that  have  flexible  but  limited  access  to  the  environment  are  more  appropriate. 
The  Visual  Routines  model  of  human  spatial  vision  [Ullman,  1984],  subsequent 
work  on  deictic  representations  [Agre  and  Chapman,  1987;  Agre,  1988;  Chapman, 
1990b],  and  other  recent  work  [Ballard,  1991;  Swain,  1990;  Rimey  and  Brown, 
1990;  Wixson,  1990;  Wixson  and  Ballard,  1991]  demonstrate  the  potential  utility 
of  active  perception.  This  dissertation  extends  this  line  of  research  by  considering 
the  application  of  reinforcement  learning  to  the  adaptive  control  of  perception  and 
action.  Our  aim  has  been  to  develop  algorithms  by  which  an  autonomous  agent 
can  learn  not  only  the  overt  actions  needed  to  perform  a  task,  but  also  perceptual 
control  strategies  for  generating  an  adequate,  yet  efficient,  task-specific  internal 
representation. 

To  study  this  problem  a  series  of  programs,  collectively  called  Meliora,  were 
developed  in  an  attempt  to  build  a  system  that  could,  using  active  perception 
and  a  deictic  representation,  learn  to  solve  simple  block  manipulation  tasks.  Us¬ 
ing  the  block-stacking  tcisk  as  a  guide,  a  formal  model  of  the  control  problem 
facing  embedded  decision  systems  was  described.  This  model  formalizes  the  effect 
of  an  agent’s  sensory-motor  system  in  establishing  the  relationship  between  an 
abstract  Markov  model  of  a  task  and  the  decision  problem  seen  by  the  embedded 
controller.  The  model  is  used  to  show  that  iiitelligent  systems  invariably  face  de- 


cision  problems  that  are  highly  non-Markov  and  that  cannot  typically  be  learned 
using  standard  reinforcement  learning  methods.  It  was  shown  that  improper  con¬ 
trol  of  the  sensory  system  leads  to  perceptual  aliasing  —  states  in  the  internal 
representation  that  confound  functionally  disparate  states  in  the  Markov  model  of 
the  external  task.  Perceptual  aliasing  was  shown  to  interfere  with  reinforcement 
learning  by  prohibiting,  at  certain  points  in  time,  the  accurate  estimation  of  the 
utility  of  performing  an  action. 

A  new  decision  procedure,  called  the  Lion  algorithm,  was  developed  to  over¬ 
come  these  difficulties  and  was  demonstrated  on  the  GB-task.  The  Lion  algorithm 
was  shown  to  be  a  specific  instance  of  a  more  general  technique,  called  the  Con¬ 
sistent  Representation  (CR)  method.  The  principal  idea  of  the  CR-method  is  to 
separate  control  into  an  identification  stage  followed  by  an  overt  control  stage. 
During  identification  the  agent  performs  perceptual  actions  to  collect  informa¬ 
tion  needed  to  generate  a  consistent  internal  state.  During  overt  control,  this 
consistent  state  is  used  to  guide  selection  of  the  next  overt  action.  Both  the  iden¬ 
tification  and  overt  control  stages  are  adaptive.  In  Meliora,  adaptive  overt  control 
was  achieved  using  variations  on  1-step  Q-learning.  Adaptation  of  the  identifica¬ 
tion  procedure  is  based  on  detecting  and  eliminating  inconsistent  internal  states 
from  the  internal  representation.  Several  methods  for  detecting  inconsistent  in¬ 
ternal  states  were  proposed  and  demonstrated  in  Meliora,  and  related  work  by 
Chapman  and  Kaelbling  (Chapman  and  Kaelbling,  1991]  and  Tan  [Tan,  1991a] 
was  described. 

Next,  the  effect  of  unbiased  search  on  the  learning  rate  was  addressed.  It  was 
shown  that  when  an  agent  has  little  or  no  prior  knowledge  of  a  task  and  when 
reward  is  sparse  or  delayed,  lack  of  initial  guidance  combined  with  lack  of  feedback 
can  lead  to  unstructured  searches  and  excessive  learning  times.  Two  cooperative 
mechanisms,  Learning-with-an-External-Critic  (LEC)  and  Learning-By- Watching 
(LBW),  that  reduce  search  were  described  and  analyzed.  Formal  analysis  showed 
that  for  a  restricted  class  of  decision  problems,  these  algorithms  have  expected 
learning  times  that  are  dependent  on  the  length  of  the  optimal  solution  path  and 
independent  of  the  state  space  size.  Experimental  results  on  the  GB-task  confirm 
these  results  and  showed  LEC  and  LBW  to  improve  significantly  the  performance 
of  Meliora.  Part  of  the  performance  improvement  was  due  to  a  reduction  in 
the  latency  between  overt  action  and  feedback;  however,  the  additional  feedback 
available  through  these  mechanisms  was  also  used  to  detect  inconsistent  internal 
states  more  quickly,  facilitating  identification. 

The  work  described  in  this  dissertation  only  partially  achieves  our  objective 
to  develop  algorithms  for  adaptive  perception  and  action.  The  decision  to  split 
perceptual  and  overt  control  into  separate  stages  precludes  the  application  of  the 
CR-method  to  tasks  that  require  overt  actions  for  identification.  Moreover,  even 
when  the  CR-method  is  applicable,  there  exist  tasks  that  it  cannot  solve  opti¬ 
mally.  Nevertheless,  the  CR-method  represents  initial  progress  towards  a  theory 
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of  adaptive  control  that  incorporates  active  perception.  Furthermore,  it  appears 
that  the  notion  of  consistency  can  be  used  to  extend  the  CR-method  and  derive 
alternative  methods  that  address  some  of  the  current  limitations.  The  Q-CUP 
algorithm  and  the  schema-based  approach  to  task  decomposition  are  examples  of 
two  such  directions  for  future  research. 
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A  Proofs  for  Search  Analysis 
Theorems 


Following  are  the  proofs  for  the  theorems  given  in  Chapter  7. 

Theorem  3:  In  a  homogeneous  state  space,  the  expected  time  needed  by  a  zero- 
initialized  Q-learning  system  to  learn  the  actions  along  an  optimal  solution  path 
for  a  problem  solving  task  is  bounded  below  by  the  expression 


T. 


where 


P=  = 


(l-2P+)2’ 

6= 

“  6=  -f  6+  -1-  6-  ’ 

b. 


6+ -I- 6-’ 


(A.1) 


(A.2) 

(A.3) 

(A.4) 

(A.5) 


P_  =  1  -  F+,  (A.6) 

and  where  i  is  the  length  of  the  optimal  solution,  and  k  is  the  depth  bound  on  the 
state  space  (with  respect  to  the  goal). 

Proof: 

There  are  two  keys  to  this  proof.  The  first  is  to  recognize  that  on  its  initial  trial 
a  zero-initialized  Q-learning  system  performs  a  pure  random  walk  that  begins  in 
the  start  slate  and  ends  in  the  goal  state.  The  second  is  to  recognize  that  the 
expected  length  of  a  random  walk  on  a  k-bounded  homogeneous  state  space  is  the 
same  as  the  expected  length  of  the  random  walk  on  an  equivalent  1-dimensional 
state  space.  That  is,  because  all  slates  (except  the  boundaries)  have  the  same  con¬ 
nectivity  pattern,  states  whose  distance  to  the  goal  are  the  same  can  be  collapsed 


into  a  single  node.  Similarly,  all  boundary  states  can  be  collapsed  into  the  same 
node.  Thus,  results  for  the  expected  search  time  required  by  a  zero-initialized  Q- 
learning  system  can  be  obtained  by  analyzing  random  walks  on  a  1-dimensional 
state  space. 

Let’s  begin  with  some  notation.  Let  6_,  6+,  and  6=  be  the  number  of  actions 
that,  in  any  given  state,  cause  the  system  to  decrease,  increase,  and  leave  the 
distance  to  the  goal  unchanged,  respectively.  Define  P+  and  as  the  conditional 
probability  that  the  system,  when  choosing  randomly,  takes  an  action  that  in¬ 
creases  or  decreases  the  distance  to  the  goal,  respectively,  given  that  a  distance 
changing  action  is  chosen.  Also,  define  P-  as  the  unconditional  probability  that 
the  system  chooses  an  action  that  leaves  the  distance  unchanged.  Then, 


p 

(A.7) 

II 

i 

(A.8) 

=  7 — r— r- 

b-  -f  64.  -f  6- 

(A.9) 

Finally,  let  denote  the  expected  length  of  a  random  walk  on  a  bounded  {k+1) 
length  1-dimensional  state  space  that  begins  in  state  i  and  ends  when  the  system 
first  encounters  state  j.  Also,  assume  the  states  in  the  state  space  are  numbered 
consecutively,  0,  2,  k  -  I,  k. 

A  closed  form  expression  for  E^q  can  be  obtained  by  expressing  E^^  recursively 
and  solving  the  recurrence.  To  do  so,  we  momentarily  assume  P-  =  0  and  make 
the  following  observations.  First,  notice  that  for  i,j  >  0,  0  <  A,  fc  —  f  <  n  and 
^^k— j,k  —  i  —  j  <  n  —  j 


pn-i 


(A.IO) 


This  follows  since  the  regions  of  the  state  spaces  involved  in  these  two  random 
walks  are  homomorphic.  Also  notice  that  in  general  for  i  <  k  <  j 


(All) 

Finally,  note  that  in  general  J?,”  =  0. 

Using  Equation  A.IO  and  Equation  A. 11  we  have,  for  >  1 

El,  =  £j,  +  £‘„  (A.12) 

=  +  £f,„.  (A.13) 

Next,  an  expression  for  E^q,  in  terms  of  expectations  for  a  —  1  state  space  can 
be  obtained  by  expanding  the  expectation.  For  k  >  I, 

=  +  +  (A.14) 
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But  by  Equation  A. 10  &  A. 11, 


■C'2,0  —  -^2,1  +  -^1,0 

_  E'k—l  I  rk 

—  -^1,0  +  •C'1,0- 


Thus,  Equation  A. 14  can  be  rewritten  as 


£*,„  =  P.  +  P41  +  +  £‘„), 


or  after  separating  terms. 


pk  _  ^ 

^1.0-  • 


Returning  to  Equation  A.  13,  we  have  for  /:  >  1 


(A.15) 

(A.16) 


(A.17) 


(A.18) 


pk  pk-1  I  1  + 

■^il-,0  -  +  1  _  p 


{A.19) 


Next,  notice  that  by  Equation  A. 11,  for  k  >  I 

I?k-\  _  pk-l  1  pk-i 

■^A-1,0  -  ^k-l,\  +  ^1,0  • 

Which  by  Equation  A. 10  leads  to 

pk-\  _  pk-2  I  pk-l 
^k-\,0  -  ^k~2,0  +  ^1,0 


(A.20) 


(A.21) 


or  solving  for  , 


4;'  =  ■Et'.'.o  -  Et-io-  (A.22) 

Substituting  Equation  A.22  into  Equation  A.19  yields  a  recursive  definition  for 


^k,o  -  ^k-i,o  +  1-  P+ 


Factoring  terms  yields 


pk  _ 
A-'i-  n  — 


I-P4 


pk-l 

^k-lfi 


_P±_ 

1-P+ 


Et:L  + 


1-P4 


Initial  conditions  for  the  recurrence  can  be  obtained  by  noticing  that 


<0  =  0 


(A.23) 


(A.24) 


(A.25) 


and  by  expanding  the  expectation 


4o  =  (l  +  0^+  +  (l)^-, 


(A.26) 


Ifi.T 


(A.27) 


which  when  solving  for  EI  q  yields 


1 

1-P+- 


To  simplify  notation,  let  a*  =  ci  =  and  cq  =  Then  for  fc  >  1, 
Equation  A.24  can  be  rewritten  as 


<**+2  =  <^l0'k+2  +  CQO,k  +  Cl> 


(A.28) 


where  cq  =  0  and  Cj  =  Equation  A.28  is  a  non-homogeneous  linear  recur¬ 
rence  equation  that  can  be  solved  using  generating  functions  [Roberts,  1984].  In 
particular,  solving  the  recurrence  [Whitehead,  1991]  yields 


P-, 

(l-2P+)2 


'p±]'Ji±2L 

P-\  '^(J-2P+) 


(l-2P+)2‘ 


(A.29) 


A  general  expression  for  E^q  can  be  obtained  from  Equation  A.29  by  noting 


that 

pk  _  pk  pk  .  pk  pk-i 
^.,0  -  ^kfl  ~  ^"k.i  '■  ^kfi  ~  ^k-ifi> 

which  when  combined  with  Equation  A.2b  yields 


(A.30) 


(1-2P+P 


I-P4 
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i 

1  -2P^' 


(A.31) 


Recall  that  the  derivation  of  Equation  A.31  is  based  on  the  assum.lion  that 
P=  =  0.  An  expression  for  E^q  for  the  genera’  case  where  P-  ^0  can  be  obtained 
by  noticing  that  E^q  in  the  general  case  is  just  Equation  A.31  multiplied  by  the 
expected  time  needed  to  take  an  action  that  changes  the  distance  to  the  goal. 
Because  action  selections  are  independent,  the  time  needed  to  choose  a  “distance 
changing  action”  is  a  geometric  random  variable  with  mean  (1  -  P=).  Thus, 


1  ~P= 


f  p-k 

f 

\(l-2P+)2 

{A.32) 


Equation  A.32  is  the  expected  search  time  needed  by  a  zero-initialized  Q- 
learning  system  to  first  solve  a  task  on  a  homogeneous  A:-bounded  state  space 
that  begins  i  steps  from  the  goal  state.  For  1-step  Q-learning,  the  system  will 
have  learned  at  most  the  last  step  along  the  optimal  solution.  Q-learning  algo¬ 
rithms  that  use  multi-step  estimators  may  learn  more  steps  along  the  optimal 
path.  In  fact,  it  is  possible  (albeit  exceedingly  unlikely)  that  they  will  correctly 
learn  all  the  steps  along  an  optimal  solution  path  in  one  trial.  In  any  case,  a  zero- 
initialized  Q-learner  must  solve  at  least  the  first  trial  by  random  search.  Thus, 
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Equation  A.32  provides  a  lower  bound  on  the  expected  learning  time. 

□ 


Theorem  6:  The  expected  time  needed  by  a  zero-initialized  BB-LEC  system  to 
learn  the  actions  along  an  optimal  path  for  a  homogeneous  state  space  of  size  k  is 
bounded  above  by 

p-^  .  1S|  .  6  (A.33) 

^critic 

where  ^critic  probability  that  on  a  given  step  the  external  critic  provides 

feedback,  |51  is  the  total  number  of  states  in  the  state  space,  b  is  the  branching 
factor  (or  total  number  of  possible  actions  per  state)  and  k  is  the  depth  of  the  state 
space. 

Proof: 

The  proof  is  based  on  the  idea  that,  because  the  state  space  is  recurrent,  the  system 
will  eventually  either  solve  the  task  or  receive  feedback  from  the  critic  for  every 
possible  state- action  pair  in  the  state  space.  By  subsumption,  the  system  will  have 
learned  (i.e.,  be  appropriately  biased  with  respect  to)  the  actions  along  an  optimal 
solution  path  once  it  has  received  feedback  for  all  possible  state-action  pairs.  Thus, 
a  weak  upper  bound  on  the  expected  time  needed  to  learn  an  optimal  policy  for 
states  along  an  optimal  path  can  be  obtained  by  determining  the  expected  time 
needed  to  receive  feedback  for  every  state-action  pair,  given  that  the  agent  has 
not  learned  the  optimal  actions  along  an  optimal  trajectory  first. 

Let  iiknuiit)  denote  the  number  of  state-action  pairs  for  which  the  system  has 
received  feedback  from  the  critic  by  time  t.  We  will  say  that  a  state-action  pair  is 
known  if  when  the  system  tried  the  pair  it  received  feedback  from  the  critic.  Notice 
that  n/;„u,(0)  =  0,  and  Uhnw  monotonically  increases  in  time,  as  the  system  receives 
feedback  for  new  state-action  pairs.  Since  there  are  at  most  [51  ♦  b  state-action 
pairs,  we  would  like  to  know  the  expected  time  required  for  nknw{t)  ~  1S|  ♦  b. 

This  can  be  obtained  by  determining  the  rate  at  which  nknw  is  expected  to 
increase,  nknw  increases  at  the  expected  rate  of  at  least  Peritidk.  That  is,  on 
average,  the  system  can  expect  to  receive  feedback  for  an  unknown  state-action 
pair  at  least  every  kfP„ttic  steps.  This  follows  since  in  any  sequence  of  k  con¬ 
secutive  steps  the  system  must  either  solve  the  task  (i.e.,  take  k  correct  actions) 
or  take  an  action  that  is  non-optimal.  But  the  system  will  take  a  non-optimal 
step  only  if  the  corresponding  state-action  pair  is  unknown.  Thus,  every  k  steps 
the  system  either  solves  the  task  or  tries  an  unknown  state-action  pair.  Finally, 
since  the  critic  provides  feedback  with  probability  Pcruic?  it  takes  on  average  at 
most  k  *  1  jPcTiuc  steps  before  the  system  tries  an  unknown  state-action  pair  and 
receives  feedback  from  the  critic. 
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Thus,  the  expected  time  for  nknw{i)  =  1^1  *  6,  is  at  most 


\S\*b 

firtltf, 

k 


(A.34) 


♦  15|  *  b. 


(A.35) 


Theorem  7;  If  a  zero~iniiialized  BB'LEC  system  has  access  to  an  inverse  model, 
then  the  expected  time  needed  to  learn  the  actions  along  an  optimal  path  for  a 
homogeneous  state  space  is  linear  in  the  solution  length  i,  independent  of  state 
space  size,  and  bounded  above  by  the  expression 


(A.36) 


Proof; 

This  theorem  is  based  on  the  idea  that  with  an  inverse  model  and  an  external 
critic,  the  agent  can  systematically  search  for  the  optimal  action  in  every  state. 
That  is,  it  can  try  an  action,  determine  from  the  critic  if  it  was  incorrect,  and  if 
so  take  the  inverse  and  try  again  or  else  proceed  to  determine  the  next  optimal 
action. 

The  expression  in  Equation  A.36  is  just  i,  the  length  of  the  optimal  solution, 
multiplied  by  the  expected  number  of  steps  required  to  determine  the  optimal 
action  in  a  given  state.  Since  for  zero-initialized  Q-learning,  we  assume  ties  are 
broken  by  selecting  randomly,  the  number  of  action-inverse  cycles  that  must  be 
tried  before  the  optimal  action  is  performed  is  at  most  the  expected  value  of  a 
geometric  random  variable  with  parameter  P-(l  -P*).  Since  all  but  the  last  cycle 
require  2  steps,  the  expected  number  of  steps  required  to  take  one  optimal  action 


is  less  than 


P-(\-Pu) 


-  1 ,  and  the  total  number  of  steps  required  is  on  average  less 


P-(l-P=) 


(A.37) 


The  above  proof  is  for  Pcntu  ==  1-0.  For  Penuc  <  1-0,  Equation  A.36  must  be 
multiplied  by  a  factor  of  l/pcntic- 


Theorem  8:  A  zero-initialized  BB-LEC  system  that  aborts  a  trial  and  starts 
anew  if  it  fails  to  solve  the  task  after  riq  (n^  >  i)  steps  has,  for  a  homogeneous 


problem  solving  task,  an  expected  initial  solution  time  that  is  linear  in  i,  indepen- 
dent  of  state  space  size,  and  bounded  from  above  by  the  expression 


(A.38) 


Proof; 

This  proof  is  similar  to  the  last  and  is  based  on  the  idea  that  by  aborting  a 
trial  after  n,  steps  the  system  can  return  to  the  site  of  previous  decisions  and 
systematically  discover  the  optimal  action.  While  in  Theorem  7  the  system  used 
an  explicit  inverse  model,  in  this  theorem  the  inverse  is  performed  implicitly. 

As  before,  the  system  must  learn  the  optimal  action  for  the  i  steps  along  the 
optimal  solution  path.  The  expected  time  needed  to  learn  each  optimal  step  is 
just  the  expected  number  of  times  the  system  must  cycle  through  this  inversion 
loop  multiplied  by  the  expected  length  of  the  cycle.  In  the  worst  case,  the  ex¬ 
pected  length  of  the  inversion  cycle  is  n,  steps.  The  expected  number  of  cycles 
required  for  each  state  along  an  optimal  trajectory  is  less  than  the  expected  value 
of  a  geometric  random  variable  with  parameter  P-.(l  -  P^),  Thus,  the  expected 
number  of  steps  required  to  learn  the  steps  along  an  optimal  path  is  less  than 


(p-d-f.i 


) 


♦  n,t. 


(A.39) 


□ 

Again,  for  peruic  <  1-0,  Equation  A.38  must  be  multiplied  by  a  factor  of 

1/P(n«ic' 


Theorem  10:  The  expected  time  required  for  a  population  of  naive  (zero-initialized) 
Q-leaming  agents  using  LBW  to  learn  the  actions  along  an  optimal  path  decreases 
to  the  minimum  required  learning  time  at  a  rate  that  is  n(l/n),  where  n  is  the 
size  of  the  population. 

Proof: 

Let  En  denote  the  expected  time  required  for  one  of  the  n  zero-initialized  Q- 
learning  agents  to  solve  the  task  for  the  first  time.  Let  Pn,k  be  the  probability 
that  one  of  the  n  agents  first  solves  the  task  in  the  kth  round.  Recall,  we  assume 
agents  operate  in  parallel.  After  the  kill  round  each  agent  has  taken  k  steps. 
Thus. 

£„  =  E  (A-40) 

k^o 

where  o  is  the  length  of  the  optimal  solution  path. 

m 


In  general,  each  agent  behaves  independently  (because  each  is  zero-initialized 
and  each  performs  a  random  walk),  so 

P.J.  =  (A-41) 

=  1-  (1  -  P,j.r-  (A.42) 

Returning  to  E)quation  A.40  we  have, 

En=0*  Pn,o  +  (1  -  Pn^)  *  ^  kPn,k\-^o  (A.43) 

k=:o+l 

where  P„,fc|-,o  is  the  probability  that  one  of  the  n  agents  first  solves  the  task  in  the 
A:th  round,  given  that  they  have  all  failed  to  do  so  in  the  oth  round.  Now,  for  all 
n  >  1 

E  E  (A.44) 

A:=o+l  k=:o  ’  1 

oO 

■£^n  <  O  ♦  P„.o  +  (1  -  Pn,o)  *  (A.45) 

Ar=o+l 

The  sum  in  the  above  equation  is  constant  and  independent  of  n  thus,  we  can 
write, 

En<0*  Pn,o  +  (1  -  Pn,o)  *  C3  (A.46) 

where  C3  =  IZ^o+i  Substituting  Equation  A.42  for  P„,o  allows  us  to 

rewrite  this  as 

Er,<C^+C2*{l-Pi,or  (A.47) 

where  Ci  =  o  and  Cj  =  C3  —  o. 

In  general,  for  0  <  i  <  1,  (1  —  x)"  is  n(l/n).  Thus,  decreases  to  the 
minimum  required  learning  time  at  rate  that  is  Q(l/n). 

'-ext,  recall  that  £„  is  the  expected  time  required  for  an  agent  to  solve  the 
task  initially-  To  learn  the  actions  along  an  optimal  path  may  require  the  agents 
to  solve  the  task  multiple  times,  where  each  solution  involves  subsequently  shorter 
and  shorter  random  walks.  In  general,  the  expected  solution  times  for  these  addi¬ 
tional  trials  have  a  form  similar  to  Equation  A.47.  Thus,  in  general,  the  expected 
-'olution  time  decrecises  towards  o  at  a  rate  that  i  n(l/n). 

□ 

Theorem  11:  /fa  nawe  1-step  Q-learning  agent  using  LBW  and  a  skilled  (opti¬ 
mal)  role  mode!  solve  identical  tasks  in  parallel  and  if  the  naive  agent  aborts  and 
restarts  the  task  after  failing  to  solve  it  in  n,  steps,  then  an  upper  bound  on  the 
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time  needed  by  the  naive  agent  to  first  solve  the  task  (and  learn  the  actions  along 
the  optimal  path)  is  given  by 

+  ».  (A.48) 


fin 

—  n, 
n. 


Proof: 

This  follows  since  after  steps  the  naive  agent  will  have  watched  the  role  model 
solve  the  task  at  least  i  times,  the  number  of  times  required  for  1-step  Q-leaming 
to  propagate  credit  along  the  optimal  path.  Thus,  after  steps  the  naive  agent 
will  know  how  to  solve  the  task  from  the  start  state.  However,  because  it  may  be 
in  the  middle  of  a  trial  it  may  perform  a  total  of 
the  last  time  and  beginning  the  final  trial.  Once 
will  solve  the  task  in  i  steps.  Thus,  the  total  number  of  steps  required  is  at  most 


j  n^.  steps  before  quitting  for 
he  final  trial  begins,  the  agent 


ii 

n. 


(A.49) 


□ 


